Q-Learning

Q-Learning
Photo by Brett Jordan / Unsplash

Today, we’re going to introduce one of the most well-known value-based reinforcement learning (RL) approaches: Q-Learning.
So, what is Q-Learning?

Q-Learning is a method where an agent learns to make decisions by estimating the value of taking certain actions in different states. It maintains a table—called a Q-table—that stores the expected future rewards for each state-action pair. As the agent explores the environment, it updates this table using feedback from its experiences, striking a balance between exploration (trying new actions) and exploitation (choosing the best-known action). This process continues until the values in the table converge.

However, Q-Learning is not suitable for environments with high-dimensional or continuous state spaces, as the Q-table becomes impractically large.

Q-Learning Introduction

A model-free RL algorithm which learns an optimal action-value function, known as the Q-function, which estimates the expected cumulative reward for taking a specific action in a given state and following a behavior policy simply for exploration thereafter. Below describes how Q-learning is defined:

$$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'}Q(s',a') - Q(s,a)]$$

where:

  • $s$: current state
  • $a$: action taken
  • $r$: immediate reward
  • $s'$: next state
  • $\alpha$: learning rate
  • $\gamma$: discount rate
  • $\max_{a'}Q(s',a')$: maximum Q-value for the next state over all possible actions

Pros and Cons

Pros:

  • Model-Free: does not require knowledge of the environment's dynamics
  • Off-Policy: allows for learning from exploration using different policy
  • Simplicity: easy to understand compared to policy-based methods

Cons:

  • Scalability: struggles with large state spaces with Q-table (solved via neural network)
  • No Generalization: lacks the ability to generalize across similar states or actions
  • Exploration-Exploitation: balance can be tricky and affects performance

Q-Learning Implementation

Here, we use the good old friend example-Frozen Lake, which is provided by the Gymnasium package.

import time
import argparse
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

class QLearningAgent:
    def __init__(self, env, 
      learning_rate=0.1, 
      discount_factor=0.95, 
      epsilon=1.0, 
      epsilon_decay=0.999, 
      epsilon_min=0.01):
        self.env = env
        self.lr = learning_rate  
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        self.q_table = np.zeros(
          (env.observation_space.n, env.action_space.n)
        )
        
    def _choose_action(self, state):
        if np.random.random() < self.epsilon:
            return self.env.action_space.sample()
        else:
            return np.argmax(self.q_table[state])
    
    def learn(self, state, action, reward, next_state, done):
        old_value = self.q_table[state, action]
        next_max = np.max(self.q_table[next_state])
        
        new_value = (1 - self.lr) * old_value \
          + self.lr * (reward + self.gamma * next_max * (1 - done))
        self.q_table[state, action] = new_value
        
        if done:
            self.epsilon = max(
              self.epsilon_min, 
              self.epsilon * self.epsilon_decay
            )
    
    def train(self, episodes):
        rewards = []
        
        for e in range(1, episodes + 1):
            state, _ = self.env.reset()
            total_reward = 0
            done = False

            print(f"-------------------- Episode {e} --------------------")
            
            while not done:
                action = self._choose_action(state)
                next_state, reward, terminated, truncated, _ = \
                  self.env.step(action)
                done = terminated or truncated

                print(f"State: {state}, Action: {action}, Reward: {reward}, Next State: {next_state}, Done: {done}")
                
                self.learn(state, action, reward, next_state, done)
                
                state = next_state
                total_reward += reward
            
            rewards.append(total_reward)
            
            if e % 100 == 0:
                print(f"Episode {e}, Average Reward: {np.mean(rewards[-100:]):.2f}, Epsilon: {self.epsilon:.2f}")
        
        return rewards
    
    def play_episode(self):
        state, _ = self.env.reset()
        total_reward = 0
        done = False
        
        while not done:
            action = np.argmax(self.q_table[state])
            state, reward, terminated, truncated, _ = self.env.step(action)
            done = terminated or truncated
            total_reward += reward
            
            self.env.render()
            time.sleep(0.5)
        
        return total_reward

    def save_model(self, filename='q_table.npy'):
        np.save(filename, self.q_table)
        print(f"Model saved to {filename}")
    
    def load_model(self, filename='q_table.npy'):
        try:
            self.q_table = np.load(filename)
            print(f"Model loaded from {filename}")
            return True
        except FileNotFoundError:
            print(f"No saved model found at {filename}")
            return False

def plot_rewards(rewards):
    window_size = 50
    avg_rewards = []
    for i in range(0, len(rewards), window_size):
        window = rewards[i:i + window_size]
        if len(window) == window_size:
            avg_rewards.append(np.mean(window))
    
    plt.figure(figsize=(10, 5))
    plt.plot(range(
      window_size, len(rewards) + 1, window_size), 
      avg_rewards, 'b-')
    plt.title('Average Rewards per 10 Episodes')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward')
    plt.grid(True)
    plt.show()

def get_env_info(env):
    print(f"Observation space number of states: {env.observation_space.n}")
    print(f"Action space number of actions: {env.action_space.n}")

def simulate(agent, episodes=5):
    print("\nPlaying episodes with trained agent:")

    for i in range(episodes):
        print(f"\nEpisode {i+1}")
        total_reward = agent.play_episode()
        print(f"Total reward: {total_reward}")

def main(args):
    if args.render:
        env = gym.make('FrozenLake-v1', render_mode="human")
    else:
        env = gym.make('FrozenLake-v1')

    get_env_info(env)

    agent = QLearningAgent(env)
    
    if args.render:
        # Try to load existing model
        if not agent.load_model():
            print("No saved model found. Please train the model first.")
            return
    else:
        # Train and save the model
        rewards = agent.train(episodes=args.episodes)
        agent.save_model()
        plot_rewards(rewards)

    if args.render:
        simulate(agent, args.render_episodes)
    env.close()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Q-Learning DQN')
    parser.add_argument('--render', 
      action='store_true', help='Render the environment')
    parser.add_argument('--episodes', 
      type=int, default=1000, help='Number of episodes to simulate')
    parser.add_argument('--render_episodes', 
      type=int, default=5, help='Number of episodes to render')
    parser.add_argument('--model_path', 
      type=str, default='q_table.npy', help='Path to save/load the model')
    args = parser.parse_args()

    main(args)

Q-Learning - CSY

From the reward diagram, we can see that the averaged reward increases as the number of episodes grows.

Q-Learning Reward Curve - CSY

From the learning curve, we can observe that Q-learning gradually improves around episodes 3500 to 4000, but begins to oscillate afterward. This behavior suggests potential underlying issues, such as inadequate exploration, sparse rewards, or improperly tuned hyperparameters.

Q-Learning Simulation - CSY

Conclusion

In this post, we discussed what kind of method Q-learning is — specifically, that it is a value-based and off-policy approach. We also covered its advantages and disadvantages, and implemented it on a simple grid-world environment: the Frozen Lake example.

However, as we know, Q-learning in its basic form is far from sufficient for real-world applications. In the next post, we’ll explore its more powerful successor — Deep Q Networks (DQN).

CSY

CSY

Nagoya, Japan