12 Jun 2025 4 min read Reinforcement Learning

Q-Learning

Today, we’re going to introduce one of the most well-known value-based reinforcement learning (RL) approaches: Q-Learning.
So, what is Q-Learning?

Q-Learning is a method where an agent learns to make decisions by estimating the value of taking certain actions in different states. It maintains a table—called a Q-table—that stores the expected future rewards for each state-action pair. As the agent explores the environment, it updates this table using feedback from its experiences, striking a balance between exploration (trying new actions) and exploitation (choosing the best-known action). This process continues until the values in the table converge.

However, Q-Learning is not suitable for environments with high-dimensional or continuous state spaces, as the Q-table becomes impractically large.

Q-Learning Introduction

A model-free RL algorithm which learns an optimal action-value function, known as the Q-function, which estimates the expected cumulative reward for taking a specific action in a given state and following a behavior policy simply for exploration thereafter. Below describes how Q-learning is defined:

$$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'}Q(s',a') - Q(s,a)]$$

where:

$s$: current state
$a$: action taken
$r$: immediate reward
$s'$: next state
$\alpha$: learning rate
$\gamma$: discount rate
$\max_{a'}Q(s',a')$: maximum Q-value for the next state over all possible actions

Pros and Cons

Pros:

Model-Free: does not require knowledge of the environment's dynamics
Off-Policy: allows for learning from exploration using different policy
Simplicity: easy to understand compared to policy-based methods

Cons:

Scalability: struggles with large state spaces with Q-table (solved via neural network)
No Generalization: lacks the ability to generalize across similar states or actions
Exploration-Exploitation: balance can be tricky and affects performance

Q-Learning Implementation

Here, we use the good old friend example-Frozen Lake, which is provided by the Gymnasium package.

import time
import argparse
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

class QLearningAgent:
    def __init__(self, env, 
      learning_rate=0.1, 
      discount_factor=0.95, 
      epsilon=1.0, 
      epsilon_decay=0.999, 
      epsilon_min=0.01):
        self.env = env
        self.lr = learning_rate  
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        self.q_table = np.zeros(
          (env.observation_space.n, env.action_space.n)
        )
        
    def _choose_action(self, state):
        if np.random.random() < self.epsilon:
            return self.env.action_space.sample()
        else:
            return np.argmax(self.q_table[state])
    
    def learn(self, state, action, reward, next_state, done):
        old_value = self.q_table[state, action]
        next_max = np.max(self.q_table[next_state])
        
        new_value = (1 - self.lr) * old_value \
          + self.lr * (reward + self.gamma * next_max * (1 - done))
        self.q_table[state, action] = new_value
        
        if done:
            self.epsilon = max(
              self.epsilon_min, 
              self.epsilon * self.epsilon_decay
            )
    
    def train(self, episodes):
        rewards = []
        
        for e in range(1, episodes + 1):
            state, _ = self.env.reset()
            total_reward = 0
            done = False

            print(f"-------------------- Episode {e} --------------------")
            
            while not done:
                action = self._choose_action(state)
                next_state, reward, terminated, truncated, _ = \
                  self.env.step(action)
                done = terminated or truncated

                print(f"State: {state}, Action: {action}, Reward: {reward}, Next State: {next_state}, Done: {done}")
                
                self.learn(state, action, reward, next_state, done)
                
                state = next_state
                total_reward += reward
            
            rewards.append(total_reward)
            
            if e % 100 == 0:
                print(f"Episode {e}, Average Reward: {np.mean(rewards[-100:]):.2f}, Epsilon: {self.epsilon:.2f}")
        
        return rewards
    
    def play_episode(self):
        state, _ = self.env.reset()
        total_reward = 0
        done = False
        
        while not done:
            action = np.argmax(self.q_table[state])
            state, reward, terminated, truncated, _ = self.env.step(action)
            done = terminated or truncated
            total_reward += reward
            
            self.env.render()
            time.sleep(0.5)
        
        return total_reward

    def save_model(self, filename='q_table.npy'):
        np.save(filename, self.q_table)
        print(f"Model saved to {filename}")
    
    def load_model(self, filename='q_table.npy'):
        try:
            self.q_table = np.load(filename)
            print(f"Model loaded from {filename}")
            return True
        except FileNotFoundError:
            print(f"No saved model found at {filename}")
            return False

def plot_rewards(rewards):
    window_size = 50
    avg_rewards = []
    for i in range(0, len(rewards), window_size):
        window = rewards[i:i + window_size]
        if len(window) == window_size:
            avg_rewards.append(np.mean(window))
    
    plt.figure(figsize=(10, 5))
    plt.plot(range(
      window_size, len(rewards) + 1, window_size), 
      avg_rewards, 'b-')
    plt.title('Average Rewards per 10 Episodes')
    plt.xlabel('Episode')
    plt.ylabel('Average Reward')
    plt.grid(True)
    plt.show()

def get_env_info(env):
    print(f"Observation space number of states: {env.observation_space.n}")
    print(f"Action space number of actions: {env.action_space.n}")

def simulate(agent, episodes=5):
    print("\nPlaying episodes with trained agent:")

    for i in range(episodes):
        print(f"\nEpisode {i+1}")
        total_reward = agent.play_episode()
        print(f"Total reward: {total_reward}")

def main(args):
    if args.render:
        env = gym.make('FrozenLake-v1', render_mode="human")
    else:
        env = gym.make('FrozenLake-v1')

    get_env_info(env)

    agent = QLearningAgent(env)
    
    if args.render:
        # Try to load existing model
        if not agent.load_model():
            print("No saved model found. Please train the model first.")
            return
    else:
        # Train and save the model
        rewards = agent.train(episodes=args.episodes)
        agent.save_model()
        plot_rewards(rewards)

    if args.render:
        simulate(agent, args.render_episodes)
    env.close()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Q-Learning DQN')
    parser.add_argument('--render', 
      action='store_true', help='Render the environment')
    parser.add_argument('--episodes', 
      type=int, default=1000, help='Number of episodes to simulate')
    parser.add_argument('--render_episodes', 
      type=int, default=5, help='Number of episodes to render')
    parser.add_argument('--model_path', 
      type=str, default='q_table.npy', help='Path to save/load the model')
    args = parser.parse_args()

    main(args)

Q-Learning - CSY

From the reward diagram, we can see that the averaged reward increases as the number of episodes grows.

From the learning curve, we can observe that Q-learning gradually improves around episodes 3500 to 4000, but begins to oscillate afterward. This behavior suggests potential underlying issues, such as inadequate exploration, sparse rewards, or improperly tuned hyperparameters.

Conclusion

In this post, we discussed what kind of method Q-learning is — specifically, that it is a value-based and off-policy approach. We also covered its advantages and disadvantages, and implemented it on a simple grid-world environment: the Frozen Lake example.

However, as we know, Q-learning in its basic form is far from sufficient for real-world applications. In the next post, we’ll explore its more powerful successor — Deep Q Networks (DQN).

CSY

Nagoya, Japan

Q-Learning Introduction

Pros and Cons

Q-Learning Implementation

Conclusion

CSY

You might also like...