← Back to curriculum

Module 4 — Deep Q-networks

DQN hyperparameters & debugging

Learning rate, buffer size, target sync, reward scaling, and common failure modes.

~60 min read + exercises

DQN hyperparameters & debugging

Before we begin

A DQN that never beats random on CartPole is almost always a hyperparameter, logging, or environment bug — not a fundamental failure of deep RL. This lesson gives a systematic debug checklist and sensible search ranges so you spend time learning, not guessing.


Learning objectives

  • Set up a minimal experiment log: return, loss, ε, Q magnitudes.
  • Diagnose failure modes from learning curves.
  • Tune learning rate, buffer size, target frequency, and exploration schedule.
  • Run a one-at-a-time ablation on CartPole.
  • Know when to stop tuning DQN and try PPO instead.

Debug dashboard — what to log every N steps

MetricHealthy trendRed flag
Episode return (rolling 100)Rises, plateaus near maxFlat at random baseline
TD lossFalls then stabilizesMonotonic explosion
Mean QGrows then stabilizesNaN or 1e6+
εDecays per scheduleStuck at 1.0 forever
Buffer sizeReaches capacityStays near zero
python
import matplotlib.pyplot as plt
 
def plot_returns(returns, window=50):
    import numpy as np
    r = np.array(returns, dtype=float)
    if len(r) < window:
        plt.plot(r)
    else:
        kernel = np.ones(window) / window
        plt.plot(np.convolve(r, kernel, mode="valid"))
    plt.xlabel("episode")
    plt.ylabel("return")
    plt.title("DQN learning curve")
    plt.show()

Worked example — reading a bad run

Symptoms: return ~20 (random CartPole ~20–30), loss decreasing, mean Q increasing slowly.

HypothesisTestFix
Insufficient explorationCheck ε scheduleSlower decay or higher final ε
Target updates too rareLog target ageDecrease C to 1000
LR too lowDouble lr ablationTry 5e-4
Wrong reward scalingPrint raw rewardsNormalize if needed
Bug in done handlingLog episode lengthFix terminal bootstrap

Often two issues combine — fix done flag first, then retune ε.

Hyperparameter search order

Tune in this sequence (one change at a time):

  1. Correctness — replay stores right tuples, train only after warmup, gamma matches env discount.
  2. Learning rate — sweep 1e-5, 3e-4, 1e-3 on short 50-episode runs.
  3. Target update frequency — 500 vs 2000 vs 10000 gradient steps.
  4. Buffer size — 10k vs 100k (CartPole); 1M for Atari.
  5. Exploration — linear ε decay over total frames.
  6. Network size — 64 vs 128 hidden; deeper rarely helps CartPole.
python
config = dict(
    lr=2.5e-4,
    gamma=0.99,
    batch_size=64,
    buffer_size=100_000,
    target_update=1000,
    eps_start=1.0,
    eps_end=0.05,
    eps_decay_frames=50_000,
    train_start=5_000,
    grad_clip=10.0,
)

Environment-specific notes

EnvironmentγReplayTrain startNotes
CartPole-v10.9950k1kSolve ~475 over 100 eps
MountainCar0.99100k10kSparse reward — PER helps
Atari0.991M50kFrame stack 4, grayscale

CartPole episodes are short — you need many episodes, not long horizon per episode.

Evaluation protocol

Separate training (ε-greedy) from evaluation (ε = 0):

python
def evaluate(agent, env, episodes=20):
    returns = []
    for _ in range(episodes):
        obs, _ = env.reset()
        done, total = False, 0.0
        while not done:
            action = agent.greedy(obs)  # no exploration
            obs, r, term, trunc, _ = env.step(action)
            done = term or trunc
            total += r
        returns.append(total)
    return sum(returns) / len(returns)

Report eval return every 10k training steps — training return alone is noisy.

Checkpoint — details: If loss goes down but return flat, your network fits stale targets that do not improve policy — check ε, target frequency, and whether actions in buffer match behavior policy. Summary: Debug with curves and ablations; tune lr and targets before architecture changes.

Common mistakes

  1. Tuning on eval noise — 5 eval episodes is not enough; use 20+.
  2. Changing five hyperparameters at once — impossible to attribute improvement.
  3. No seed control — run 3 seeds before declaring victory.
  4. Comparing to papers without matching frames vs episodes — Atari counts frames; CartPole counts episodes.
  5. Ignoring Gymnasium APIterminated vs truncated both end episode but mean different things for bootstrap.

Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.