Instability & the deadly triad
Before we begin
Three ingredients together can make value-based RL diverge: (1) function approximation, (2) bootstrapping (TD targets), (3) off-policy training. Sutton calls this the deadly triad. Tabular on-policy MC is safe; DQN sits in the danger zone by design.
Knowing the triad explains why experience replay, target networks, clipped objectives, and on-policy algorithms exist.
What you will learn
- Name the three elements of the deadly triad.
- Explain positive feedback in bootstrapped FA (chasing moving targets).
- Relate instability to DQN mitigations (replay, target net, clip).
- Compare stabilizers: PPO (on-policy), SAC (off-policy with tricks).
- Diagnose divergence from learning curves and Q-value magnitudes.
The deadly triad
| Ingredient | Role in instability |
|---|---|
| Function approximation | Errors generalize — wrong Q(s) poisons similar states |
| Bootstrapping | Target uses current estimate V̂(s′) or max Q̂ |
| Off-policy | Data distribution ≠ target policy; extrapolation |
Remove any one:
- Tabular → no harmful generalization
- MC targets → no bootstrap bias loop
- On-policy → less distribution mismatch (still can struggle with FA + bootstrap)
Positive feedback loop
- Overestimate Q(s, a) for some s.
- max_a Q(s′, a) inflates target at predecessor states.
- Gradient spreads overestimation via shared θ.
- Q values explode → argmax flips wildly → policy chaotic.
# Divergence symptom: monitor during training
if step % 1000 == 0:
max_q = q_values.max().item()
if max_q > 1e4:
print("Warning: Q magnitude exploding", max_q)Mitigations map
| Technique | Attacks |
|---|---|
| Experience replay | Correlation, non-i.i.d. samples |
| Target network | Moving target (slow θ⁻) |
| Double DQN | max overestimation bias |
| Gradient clipping | Parameter blow-up |
| Huber loss | Outlier TD errors |
| On-policy (PPO) | Off-policy leg of triad |
| Lower learning rate | All — slows divergence |
Module 4 covers DQN stack in depth.
Worked intuition: linear divergent example
Boyan and others show off-policy linear TD can diverge even in simple MDPs with innocent-looking features. Not every (φ, γ, α) triple is safe.
| Setup | Outcome |
|---|---|
| On-policy linear TD(0) | Converges (under conditions) |
| Off-policy linear TD | Can diverge |
| Tabular Q-learning | Converges (visits) |
Diagnostics
| Signal | Likely issue |
|---|---|
| Q mean drifts upward forever | Overestimation, no target net |
| Loss down, eval return flat | Overfitting replay, stale data |
| Sudden return collapse | Policy chasing spurious Q peak |
| NaN weights | α too high, no grad clip |
Checkpoint: You remove replay but keep off-policy Q-learning + neural net. What triad property worsens?
Answer
Correlated consecutive updates amplify bootstrap errors — replay breaks temporal correlation. Off-policy + FA + bootstrap still present; sample correlation makes instability more likely.
Design philosophy
| Priority | Choose |
|---|---|
| Stability first | PPO, careful on-policy |
| Sample efficiency | SAC, DQN + full stabilizer kit |
| Simplicity / debug | Tabular or linear FA |
Production teams log Q percentiles, TD error histograms, and policy entropy — not just return.
Common mistakes
- Assuming loss ↓ implies good policy.
- Tuning only learning rate when replay size and target sync are wrong.
- Ignoring non-stationary data in off-policy FA.
- Copying DQN hyperparameters from Atari to CartPole without scaling.
You now know why deep RL needs engineering beyond the Bellman update. Module 4 builds DQN with experience replay and target networks — turning unstable naive Q-learning into a trainable system on CartPole and Atari.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.