Temporal-difference learning
Before we begin
Temporal-difference (TD) learning combines sampling (like MC) with bootstrapping (like DP): update estimates from the next state’s value without waiting for episode end. The TD error measures surprise relative to the Bellman prediction.
TD(0) for V^π is the bridge from DP to modern deep RL.
What you will learn
- Write the TD(0) update for V^π.
- Define TD error δₜ.
- Contrast TD with Monte Carlo on bias, variance, and update timing.
- Implement online TD prediction on a random walk.
- Preview TD(λ) and eligibility traces at a high level.
TD(0) prediction
After transition (s, r, s′):
V(s) ← V(s) + α [ r + γ V(s′) − V(s) ]
Bracket term is TD error:
δₜ = rₜ₊₁ + γ V(sₜ₊₁) − V(sₜ)
| Term | Role |
|---|---|
| r + γ V(s′) | Target (one-step Bellman) |
| V(s) | Current estimate |
| α | Learning rate |
At terminal s′, V(s′) = 0 by convention.
Worked numeric step
s = Start, r = 0, s′ = NearGoal, γ = 0.9, V(Start)=0, V(NearGoal)=5, α = 0.1.
Target = 0 + 0.9 × 5 = 4.5
δ = 4.5 − 0 = 4.5
V(Start) ← 0 + 0.1 × 4.5 = 0.45
One step moved value toward Bellman consistency.
def td0_step(V, s, r, s_next, gamma, alpha, terminal=False):
target = r + (0 if terminal else gamma * V[s_next])
delta = target - V[s]
V[s] += alpha * delta
return deltaCheckpoint: MC vs TD after one step in the middle of an episode — which has seen the final reward?
Answer
MC has not updated yet (waits for episode end). TD already bootstrapped from V(s′), which may be wrong early — bias but lower variance.
Bias–variance trade-off
| Method | Target uses | Bias | Variance |
|---|---|---|---|
| MC | Full Gₜ | Low | High |
| TD(0) | r + γ V(s′) | Some | Lower |
| DP | Full Bellman sum | None | N/A (no samples) |
As visits grow, V → V^π and TD bias fades.
Random walk example
7 states, ends 0 and 6 terminal with V=0. Middle starts V=0.5. Each step +1 or −1 with equal prob, r=0 except exit. True V(mid)=0.
Run many episodes with α = 0.1 — V curves toward 0. TD learns online; MC jumps at episode ends.
TD control: SARSA preview
TD extends to Q^π with actions — SARSA (next lesson) uses:
Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ]
Bootstraps from actual next action a′ the policy will take.
TD(λ) in one paragraph
Eligibility traces e(s) credit recent states for TD errors. λ=0 → TD(0); λ=1 → MC-like. Useful when rewards are delayed — bridges MC and TD.
Common mistakes
- α too large — oscillation or divergence.
- Bootstrapping from non-terminal when episode actually ended (truncation vs termination).
- Updating V(s′) when s′ should not bootstrap (true terminal).
- Comparing MC and TD curves at different sample counts — TD gets more updates per episode.
TD learning enables step-by-step control algorithms. Next: Q-learning and SARSA — off-policy vs on-policy TD control.