Temporal-difference learning

Before we begin

Temporal-difference (TD) learning combines sampling (like MC) with bootstrapping (like DP): update estimates from the next state’s value without waiting for episode end. The TD error measures surprise relative to the Bellman prediction.

TD(0) for V^π is the bridge from DP to modern deep RL.

What you will learn

Write the TD(0) update for V^π.
Define TD error δₜ.
Contrast TD with Monte Carlo on bias, variance, and update timing.
Implement online TD prediction on a random walk.
Preview TD(λ) and eligibility traces at a high level.

TD(0) prediction

After transition (s, r, s′):

V(s) ← V(s) + α [ r + γ V(s′) − V(s) ]

Bracket term is TD error:

δₜ = rₜ₊₁ + γ V(sₜ₊₁) − V(sₜ)

Term	Role
r + γ V(s′)	Target (one-step Bellman)
V(s)	Current estimate
α	Learning rate

At terminal s′, V(s′) = 0 by convention.

Worked numeric step

s = Start, r = 0, s′ = NearGoal, γ = 0.9, V(Start)=0, V(NearGoal)=5, α = 0.1.

Target = 0 + 0.9 × 5 = 4.5
δ = 4.5 − 0 = 4.5
V(Start) ← 0 + 0.1 × 4.5 = 0.45

One step moved value toward Bellman consistency.

python

def td0_step(V, s, r, s_next, gamma, alpha, terminal=False):
    target = r + (0 if terminal else gamma * V[s_next])
    delta = target - V[s]
    V[s] += alpha * delta
    return delta

Checkpoint: MC vs TD after one step in the middle of an episode — which has seen the final reward?

Answer

MC has not updated yet (waits for episode end). TD already bootstrapped from V(s′), which may be wrong early — bias but lower variance.

Bias–variance trade-off

Method	Target uses	Bias	Variance
MC	Full Gₜ	Low	High
TD(0)	r + γ V(s′)	Some	Lower
DP	Full Bellman sum	None	N/A (no samples)

As visits grow, V → V^π and TD bias fades.

Random walk example

7 states, ends 0 and 6 terminal with V=0. Middle starts V=0.5. Each step +1 or −1 with equal prob, r=0 except exit. True V(mid)=0.

Run many episodes with α = 0.1 — V curves toward 0. TD learns online; MC jumps at episode ends.

TD control: SARSA preview

TD extends to Q^π with actions — SARSA (next lesson) uses:

Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ]

Bootstraps from actual next action a′ the policy will take.

TD(λ) in one paragraph

Eligibility traces e(s) credit recent states for TD errors. λ=0 → TD(0); λ=1 → MC-like. Useful when rewards are delayed — bridges MC and TD.

Common mistakes

α too large — oscillation or divergence.
Bootstrapping from non-terminal when episode actually ended (truncation vs termination).
Updating V(s′) when s′ should not bootstrap (true terminal).
Comparing MC and TD curves at different sample counts — TD gets more updates per episode.

TD learning enables step-by-step control algorithms. Next: Q-learning and SARSA — off-policy vs on-policy TD control.

Before this lesson

Previous lesson

What's next

Next lesson — Q-learning & SARSA