← Back to curriculum

Module 3 — Function approximation

Instability & the deadly triad

Function approximation + bootstrapping + off-policy — why DQN needs tricks.

~60 min read + exercises

Instability & the deadly triad

Before we begin

Three ingredients together can make value-based RL diverge: (1) function approximation, (2) bootstrapping (TD targets), (3) off-policy training. Sutton calls this the deadly triad. Tabular on-policy MC is safe; DQN sits in the danger zone by design.

Knowing the triad explains why experience replay, target networks, clipped objectives, and on-policy algorithms exist.


What you will learn

  • Name the three elements of the deadly triad.
  • Explain positive feedback in bootstrapped FA (chasing moving targets).
  • Relate instability to DQN mitigations (replay, target net, clip).
  • Compare stabilizers: PPO (on-policy), SAC (off-policy with tricks).
  • Diagnose divergence from learning curves and Q-value magnitudes.

The deadly triad

IngredientRole in instability
Function approximationErrors generalize — wrong Q(s) poisons similar states
BootstrappingTarget uses current estimate V̂(s′) or max Q̂
Off-policyData distribution ≠ target policy; extrapolation

Remove any one:

  • Tabular → no harmful generalization
  • MC targets → no bootstrap bias loop
  • On-policy → less distribution mismatch (still can struggle with FA + bootstrap)

Positive feedback loop

  1. Overestimate Q(s, a) for some s.
  2. max_a Q(s′, a) inflates target at predecessor states.
  3. Gradient spreads overestimation via shared θ.
  4. Q values explode → argmax flips wildly → policy chaotic.
python
# Divergence symptom: monitor during training
if step % 1000 == 0:
    max_q = q_values.max().item()
    if max_q > 1e4:
        print("Warning: Q magnitude exploding", max_q)

Mitigations map

TechniqueAttacks
Experience replayCorrelation, non-i.i.d. samples
Target networkMoving target (slow θ⁻)
Double DQNmax overestimation bias
Gradient clippingParameter blow-up
Huber lossOutlier TD errors
On-policy (PPO)Off-policy leg of triad
Lower learning rateAll — slows divergence

Module 4 covers DQN stack in depth.


Worked intuition: linear divergent example

Boyan and others show off-policy linear TD can diverge even in simple MDPs with innocent-looking features. Not every (φ, γ, α) triple is safe.

SetupOutcome
On-policy linear TD(0)Converges (under conditions)
Off-policy linear TDCan diverge
Tabular Q-learningConverges (visits)

Diagnostics

SignalLikely issue
Q mean drifts upward foreverOverestimation, no target net
Loss down, eval return flatOverfitting replay, stale data
Sudden return collapsePolicy chasing spurious Q peak
NaN weightsα too high, no grad clip

Checkpoint: You remove replay but keep off-policy Q-learning + neural net. What triad property worsens?

Answer

Correlated consecutive updates amplify bootstrap errors — replay breaks temporal correlation. Off-policy + FA + bootstrap still present; sample correlation makes instability more likely.


Design philosophy

PriorityChoose
Stability firstPPO, careful on-policy
Sample efficiencySAC, DQN + full stabilizer kit
Simplicity / debugTabular or linear FA

Production teams log Q percentiles, TD error histograms, and policy entropy — not just return.


Common mistakes

  • Assuming loss ↓ implies good policy.
  • Tuning only learning rate when replay size and target sync are wrong.
  • Ignoring non-stationary data in off-policy FA.
  • Copying DQN hyperparameters from Atari to CartPole without scaling.

You now know why deep RL needs engineering beyond the Bellman update. Module 4 builds DQN with experience replay and target networks — turning unstable naive Q-learning into a trainable system on CartPole and Atari.


Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.