DQN hyperparameters & debugging

Before we begin

A DQN that never beats random on CartPole is almost always a hyperparameter, logging, or environment bug — not a fundamental failure of deep RL. This lesson gives a systematic debug checklist and sensible search ranges so you spend time learning, not guessing.

Learning objectives

Set up a minimal experiment log: return, loss, ε, Q magnitudes.
Diagnose failure modes from learning curves.
Tune learning rate, buffer size, target frequency, and exploration schedule.
Run a one-at-a-time ablation on CartPole.
Know when to stop tuning DQN and try PPO instead.

Debug dashboard — what to log every N steps

Metric	Healthy trend	Red flag
Episode return (rolling 100)	Rises, plateaus near max	Flat at random baseline
TD loss	Falls then stabilizes	Monotonic explosion
Mean Q	Grows then stabilizes	NaN or 1e6+
ε	Decays per schedule	Stuck at 1.0 forever
Buffer size	Reaches capacity	Stays near zero

python

import matplotlib.pyplot as plt
 
def plot_returns(returns, window=50):
    import numpy as np
    r = np.array(returns, dtype=float)
    if len(r) < window:
        plt.plot(r)
    else:
        kernel = np.ones(window) / window
        plt.plot(np.convolve(r, kernel, mode="valid"))
    plt.xlabel("episode")
    plt.ylabel("return")
    plt.title("DQN learning curve")
    plt.show()

Worked example — reading a bad run

Symptoms: return ~20 (random CartPole ~20–30), loss decreasing, mean Q increasing slowly.

Hypothesis	Test	Fix
Insufficient exploration	Check ε schedule	Slower decay or higher final ε
Target updates too rare	Log target age	Decrease C to 1000
LR too low	Double lr ablation	Try 5e-4
Wrong reward scaling	Print raw rewards	Normalize if needed
Bug in done handling	Log episode length	Fix terminal bootstrap

Often two issues combine — fix done flag first, then retune ε.

Hyperparameter search order

Tune in this sequence (one change at a time):

Correctness — replay stores right tuples, train only after warmup, gamma matches env discount.
Learning rate — sweep 1e-5, 3e-4, 1e-3 on short 50-episode runs.
Target update frequency — 500 vs 2000 vs 10000 gradient steps.
Buffer size — 10k vs 100k (CartPole); 1M for Atari.
Exploration — linear ε decay over total frames.
Network size — 64 vs 128 hidden; deeper rarely helps CartPole.

python

config = dict(
    lr=2.5e-4,
    gamma=0.99,
    batch_size=64,
    buffer_size=100_000,
    target_update=1000,
    eps_start=1.0,
    eps_end=0.05,
    eps_decay_frames=50_000,
    train_start=5_000,
    grad_clip=10.0,
)

Environment-specific notes

Environment	γ	Replay	Train start	Notes
CartPole-v1	0.99	50k	1k	Solve ~475 over 100 eps
MountainCar	0.99	100k	10k	Sparse reward — PER helps
Atari	0.99	1M	50k	Frame stack 4, grayscale

CartPole episodes are short — you need many episodes, not long horizon per episode.

Evaluation protocol

Separate training (ε-greedy) from evaluation (ε = 0):

python

def evaluate(agent, env, episodes=20):
    returns = []
    for _ in range(episodes):
        obs, _ = env.reset()
        done, total = False, 0.0
        while not done:
            action = agent.greedy(obs)  # no exploration
            obs, r, term, trunc, _ = env.step(action)
            done = term or trunc
            total += r
        returns.append(total)
    return sum(returns) / len(returns)

Report eval return every 10k training steps — training return alone is noisy.

Checkpoint — details: If loss goes down but return flat, your network fits stale targets that do not improve policy — check ε, target frequency, and whether actions in buffer match behavior policy. Summary: Debug with curves and ablations; tune lr and targets before architecture changes.

Common mistakes

Tuning on eval noise — 5 eval episodes is not enough; use 20+.
Changing five hyperparameters at once — impossible to attribute improvement.
No seed control — run 3 seeds before declaring victory.
Comparing to papers without matching frames vs episodes — Atari counts frames; CartPole counts episodes.
Ignoring Gymnasium API — terminated vs truncated both end episode but mean different things for bootstrap.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.