DQN hyperparameters & debugging
Before we begin
A DQN that never beats random on CartPole is almost always a hyperparameter, logging, or environment bug — not a fundamental failure of deep RL. This lesson gives a systematic debug checklist and sensible search ranges so you spend time learning, not guessing.
Learning objectives
- Set up a minimal experiment log: return, loss, ε, Q magnitudes.
- Diagnose failure modes from learning curves.
- Tune learning rate, buffer size, target frequency, and exploration schedule.
- Run a one-at-a-time ablation on CartPole.
- Know when to stop tuning DQN and try PPO instead.
Debug dashboard — what to log every N steps
| Metric | Healthy trend | Red flag |
|---|---|---|
| Episode return (rolling 100) | Rises, plateaus near max | Flat at random baseline |
| TD loss | Falls then stabilizes | Monotonic explosion |
| Mean Q | Grows then stabilizes | NaN or 1e6+ |
| ε | Decays per schedule | Stuck at 1.0 forever |
| Buffer size | Reaches capacity | Stays near zero |
import matplotlib.pyplot as plt
def plot_returns(returns, window=50):
import numpy as np
r = np.array(returns, dtype=float)
if len(r) < window:
plt.plot(r)
else:
kernel = np.ones(window) / window
plt.plot(np.convolve(r, kernel, mode="valid"))
plt.xlabel("episode")
plt.ylabel("return")
plt.title("DQN learning curve")
plt.show()Worked example — reading a bad run
Symptoms: return ~20 (random CartPole ~20–30), loss decreasing, mean Q increasing slowly.
| Hypothesis | Test | Fix |
|---|---|---|
| Insufficient exploration | Check ε schedule | Slower decay or higher final ε |
| Target updates too rare | Log target age | Decrease C to 1000 |
| LR too low | Double lr ablation | Try 5e-4 |
| Wrong reward scaling | Print raw rewards | Normalize if needed |
| Bug in done handling | Log episode length | Fix terminal bootstrap |
Often two issues combine — fix done flag first, then retune ε.
Hyperparameter search order
Tune in this sequence (one change at a time):
- Correctness — replay stores right tuples, train only after warmup, gamma matches env discount.
- Learning rate — sweep 1e-5, 3e-4, 1e-3 on short 50-episode runs.
- Target update frequency — 500 vs 2000 vs 10000 gradient steps.
- Buffer size — 10k vs 100k (CartPole); 1M for Atari.
- Exploration — linear ε decay over total frames.
- Network size — 64 vs 128 hidden; deeper rarely helps CartPole.
config = dict(
lr=2.5e-4,
gamma=0.99,
batch_size=64,
buffer_size=100_000,
target_update=1000,
eps_start=1.0,
eps_end=0.05,
eps_decay_frames=50_000,
train_start=5_000,
grad_clip=10.0,
)Environment-specific notes
| Environment | γ | Replay | Train start | Notes |
|---|---|---|---|---|
| CartPole-v1 | 0.99 | 50k | 1k | Solve ~475 over 100 eps |
| MountainCar | 0.99 | 100k | 10k | Sparse reward — PER helps |
| Atari | 0.99 | 1M | 50k | Frame stack 4, grayscale |
CartPole episodes are short — you need many episodes, not long horizon per episode.
Evaluation protocol
Separate training (ε-greedy) from evaluation (ε = 0):
def evaluate(agent, env, episodes=20):
returns = []
for _ in range(episodes):
obs, _ = env.reset()
done, total = False, 0.0
while not done:
action = agent.greedy(obs) # no exploration
obs, r, term, trunc, _ = env.step(action)
done = term or trunc
total += r
returns.append(total)
return sum(returns) / len(returns)Report eval return every 10k training steps — training return alone is noisy.
Checkpoint — details: If loss goes down but return flat, your network fits stale targets that do not improve policy — check ε, target frequency, and whether actions in buffer match behavior policy. Summary: Debug with curves and ablations; tune lr and targets before architecture changes.
Common mistakes
- Tuning on eval noise — 5 eval episodes is not enough; use 20+.
- Changing five hyperparameters at once — impossible to attribute improvement.
- No seed control — run 3 seeds before declaring victory.
- Comparing to papers without matching frames vs episodes — Atari counts frames; CartPole counts episodes.
- Ignoring Gymnasium API —
terminatedvstruncatedboth end episode but mean different things for bootstrap.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.