← Back to curriculum

Module 2 — Tabular methods

Temporal-difference learning

TD(0), bootstrapping vs MC, TD error, and n-step returns preview.

~65 min read + exercises

Temporal-difference learning

Before we begin

Temporal-difference (TD) learning combines sampling (like MC) with bootstrapping (like DP): update estimates from the next state’s value without waiting for episode end. The TD error measures surprise relative to the Bellman prediction.

TD(0) for V^π is the bridge from DP to modern deep RL.


What you will learn

  • Write the TD(0) update for V^π.
  • Define TD error δₜ.
  • Contrast TD with Monte Carlo on bias, variance, and update timing.
  • Implement online TD prediction on a random walk.
  • Preview TD(λ) and eligibility traces at a high level.

TD(0) prediction

After transition (s, r, s′):

V(s) ← V(s) + α [ r + γ V(s′) − V(s) ]

Bracket term is TD error:

δₜ = rₜ₊₁ + γ V(sₜ₊₁) − V(sₜ)

TermRole
r + γ V(s′)Target (one-step Bellman)
V(s)Current estimate
αLearning rate

At terminal s′, V(s′) = 0 by convention.


Worked numeric step

s = Start, r = 0, s′ = NearGoal, γ = 0.9, V(Start)=0, V(NearGoal)=5, α = 0.1.

Target = 0 + 0.9 × 5 = 4.5
δ = 4.5 − 0 = 4.5
V(Start) ← 0 + 0.1 × 4.5 = 0.45

One step moved value toward Bellman consistency.

python
def td0_step(V, s, r, s_next, gamma, alpha, terminal=False):
    target = r + (0 if terminal else gamma * V[s_next])
    delta = target - V[s]
    V[s] += alpha * delta
    return delta

Checkpoint: MC vs TD after one step in the middle of an episode — which has seen the final reward?

Answer

MC has not updated yet (waits for episode end). TD already bootstrapped from V(s′), which may be wrong early — bias but lower variance.


Bias–variance trade-off

MethodTarget usesBiasVariance
MCFull GₜLowHigh
TD(0)r + γ V(s′)SomeLower
DPFull Bellman sumNoneN/A (no samples)

As visits grow, V → V^π and TD bias fades.


Random walk example

7 states, ends 0 and 6 terminal with V=0. Middle starts V=0.5. Each step +1 or −1 with equal prob, r=0 except exit. True V(mid)=0.

Run many episodes with α = 0.1 — V curves toward 0. TD learns online; MC jumps at episode ends.


TD control: SARSA preview

TD extends to Q^π with actions — SARSA (next lesson) uses:

Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ]

Bootstraps from actual next action a′ the policy will take.


TD(λ) in one paragraph

Eligibility traces e(s) credit recent states for TD errors. λ=0 → TD(0); λ=1 → MC-like. Useful when rewards are delayed — bridges MC and TD.


Common mistakes

  • α too large — oscillation or divergence.
  • Bootstrapping from non-terminal when episode actually ended (truncation vs termination).
  • Updating V(s′) when s′ should not bootstrap (true terminal).
  • Comparing MC and TD curves at different sample counts — TD gets more updates per episode.

TD learning enables step-by-step control algorithms. Next: Q-learning and SARSA — off-policy vs on-policy TD control.


Before this lesson


What's next