Instability & the deadly triad

Before we begin

Three ingredients together can make value-based RL diverge: (1) function approximation, (2) bootstrapping (TD targets), (3) off-policy training. Sutton calls this the deadly triad. Tabular on-policy MC is safe; DQN sits in the danger zone by design.

Knowing the triad explains why experience replay, target networks, clipped objectives, and on-policy algorithms exist.

What you will learn

Name the three elements of the deadly triad.
Explain positive feedback in bootstrapped FA (chasing moving targets).
Relate instability to DQN mitigations (replay, target net, clip).
Compare stabilizers: PPO (on-policy), SAC (off-policy with tricks).
Diagnose divergence from learning curves and Q-value magnitudes.

The deadly triad

Ingredient	Role in instability
Function approximation	Errors generalize — wrong Q(s) poisons similar states
Bootstrapping	Target uses current estimate V̂(s′) or max Q̂
Off-policy	Data distribution ≠ target policy; extrapolation

Remove any one:

Tabular → no harmful generalization
MC targets → no bootstrap bias loop
On-policy → less distribution mismatch (still can struggle with FA + bootstrap)

Positive feedback loop

Overestimate Q(s, a) for some s.
max_a Q(s′, a) inflates target at predecessor states.
Gradient spreads overestimation via shared θ.
Q values explode → argmax flips wildly → policy chaotic.

python

# Divergence symptom: monitor during training
if step % 1000 == 0:
    max_q = q_values.max().item()
    if max_q > 1e4:
        print("Warning: Q magnitude exploding", max_q)

Mitigations map

Technique	Attacks
Experience replay	Correlation, non-i.i.d. samples
Target network	Moving target (slow θ⁻)
Double DQN	max overestimation bias
Gradient clipping	Parameter blow-up
Huber loss	Outlier TD errors
On-policy (PPO)	Off-policy leg of triad
Lower learning rate	All — slows divergence

Module 4 covers DQN stack in depth.

Worked intuition: linear divergent example

Boyan and others show off-policy linear TD can diverge even in simple MDPs with innocent-looking features. Not every (φ, γ, α) triple is safe.

Setup	Outcome
On-policy linear TD(0)	Converges (under conditions)
Off-policy linear TD	Can diverge
Tabular Q-learning	Converges (visits)

Diagnostics

Signal	Likely issue
Q mean drifts upward forever	Overestimation, no target net
Loss down, eval return flat	Overfitting replay, stale data
Sudden return collapse	Policy chasing spurious Q peak
NaN weights	α too high, no grad clip

Checkpoint: You remove replay but keep off-policy Q-learning + neural net. What triad property worsens?

Answer

Correlated consecutive updates amplify bootstrap errors — replay breaks temporal correlation. Off-policy + FA + bootstrap still present; sample correlation makes instability more likely.

Design philosophy

Priority	Choose
Stability first	PPO, careful on-policy
Sample efficiency	SAC, DQN + full stabilizer kit
Simplicity / debug	Tabular or linear FA

Production teams log Q percentiles, TD error histograms, and policy entropy — not just return.

Common mistakes

Assuming loss ↓ implies good policy.
Tuning only learning rate when replay size and target sync are wrong.
Ignoring non-stationary data in off-policy FA.
Copying DQN hyperparameters from Atari to CartPole without scaling.

You now know why deep RL needs engineering beyond the Bellman update. Module 4 builds DQN with experience replay and target networks — turning unstable naive Q-learning into a trainable system on CartPole and Atari.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.