← Back to curriculum

Module 3 — Function approximation

Linear function approximation

Feature vectors, linear V and Q, semi-gradient TD, and stability caveats.

~65 min read + exercises

Linear function approximation

Before we begin

Linear function approximation represents value or Q as a dot product of features φ(s) and weights w. Updates are often convex (for linear TD with appropriate targets) and interpretable — you can inspect which features drive value.

Many production systems used linear FA for years; understanding it clarifies what neural nets replace and what can still go wrong.


What you will learn

  • Write V̂(s) = wᵀ φ(s) and Q̂(s,a) = w_aᵀ φ(s).
  • Apply semi-gradient TD updates for value prediction.
  • Implement linear Q-learning with tile coding intuition.
  • Compare batch vs stochastic gradient updates.
  • Know when linear models suffice vs need deep nets.

Linear value function

V̂(s; w) = Σᵢ wᵢ φᵢ(s) = wᵀ φ(s)

Features φ(s) can be raw obs, polynomials, Fourier basis, tile coding, RBFs.

Feature choiceProsCons
Raw stateSimpleMay not generalize
PolynomialsSmooth functionsCurse of dimensionality
Tile codingCoarse generalizationHand-tuned bins

Semi-gradient TD(0) update

Target y = r + γ V̂(s′; w) — treat target as fixed for gradient:

w ← w + α [ y − V̂(s; w) ] ∇_w V̂(s; w)

For linear V̂, ∇_w V̂(s) = φ(s):

w ← w + α δ φ(s), δ = r + γ wᵀφ(s′) − wᵀφ(s)

python
import numpy as np
 
def linear_td0(w, phi_s, phi_s_next, r, gamma, alpha, terminal=False):
    v = w @ phi_s
    v_next = 0.0 if terminal else w @ phi_s_next
    delta = r + gamma * v_next - v
    w += alpha * delta * phi_s
    return w, delta

Not true gradient descent on a fixed loss — semi-gradient because target depends on w.


Linear Q-learning

Separate w_a per action:

Q̂(s, a) = w_aᵀ φ(s)

Update for taken action a:

w_a ← w_a + α [ r + γ max_a′ w_a′ᵀ φ(s′) − w_aᵀ φ(s) ] φ(s)

python
def linear_q_step(W, phi_s, phi_s_next, s_idx, a, r, gamma, alpha, terminal):
    q = W[a] @ phi_s
    if terminal:
        target = r
    else:
        target = r + gamma * np.max(W @ phi_s_next)
    W[a] += alpha * (target - q) * phi_s

Worked numeric step

φ(s) = [1, x], w = [0, 0], r=1, φ(s′)=[1, 2], γ=0.9, α=0.5, non-terminal.

V̂(s)=0, target = 1 + 0.9×0 = 1, δ=1
w ← [0,0] + 0.5×1×[1,x] — depends on x feature of s.


Tile coding intuition

Split each dimension into intervals; tiles are overlapping hyper-rectangles. φ(s) is binary vector of active tiles — sparse, generalizes locally.

CartPole with angle bins: nearby angles share tiles → similar Q.


Batch methods (preview)

LSTD, least-squares TD solve for w in one shot from batch data — sample efficient but O(d²) memory. Useful when d is modest.


Common mistakes

  • Features not normalized — huge x dominates dot product.
  • Using same w for all actions when Q(s,a) differs sharply per action.
  • Forgetting terminal: no γ V̂(s′) at episode end.
  • Assuming semi-gradient always converges — off-policy linear FA can diverge (next lesson).

Linear FA scales to moderate problems and CartPole with good features. Neural networks replace hand-crafted φ with learned representations — same update shape, far more capacity, new instability modes.


Before this lesson


What's next