Linear function approximation

Before we begin

Linear function approximation represents value or Q as a dot product of features φ(s) and weights w. Updates are often convex (for linear TD with appropriate targets) and interpretable — you can inspect which features drive value.

Many production systems used linear FA for years; understanding it clarifies what neural nets replace and what can still go wrong.

What you will learn

Write V̂(s) = wᵀ φ(s) and Q̂(s,a) = w_aᵀ φ(s).
Apply semi-gradient TD updates for value prediction.
Implement linear Q-learning with tile coding intuition.
Compare batch vs stochastic gradient updates.
Know when linear models suffice vs need deep nets.

Linear value function

V̂(s; w) = Σᵢ wᵢ φᵢ(s) = wᵀ φ(s)

Features φ(s) can be raw obs, polynomials, Fourier basis, tile coding, RBFs.

Feature choice	Pros	Cons
Raw state	Simple	May not generalize
Polynomials	Smooth functions	Curse of dimensionality
Tile coding	Coarse generalization	Hand-tuned bins

Semi-gradient TD(0) update

Target y = r + γ V̂(s′; w) — treat target as fixed for gradient:

w ← w + α [ y − V̂(s; w) ] ∇_w V̂(s; w)

For linear V̂, ∇_w V̂(s) = φ(s):

w ← w + α δ φ(s), δ = r + γ wᵀφ(s′) − wᵀφ(s)

python

import numpy as np
 
def linear_td0(w, phi_s, phi_s_next, r, gamma, alpha, terminal=False):
    v = w @ phi_s
    v_next = 0.0 if terminal else w @ phi_s_next
    delta = r + gamma * v_next - v
    w += alpha * delta * phi_s
    return w, delta

Not true gradient descent on a fixed loss — semi-gradient because target depends on w.

Linear Q-learning

Separate w_a per action:

Q̂(s, a) = w_aᵀ φ(s)

Update for taken action a:

w_a ← w_a + α [ r + γ max_a′ w_a′ᵀ φ(s′) − w_aᵀ φ(s) ] φ(s)

python

def linear_q_step(W, phi_s, phi_s_next, s_idx, a, r, gamma, alpha, terminal):
    q = W[a] @ phi_s
    if terminal:
        target = r
    else:
        target = r + gamma * np.max(W @ phi_s_next)
    W[a] += alpha * (target - q) * phi_s

Worked numeric step

φ(s) = [1, x], w = [0, 0], r=1, φ(s′)=[1, 2], γ=0.9, α=0.5, non-terminal.

V̂(s)=0, target = 1 + 0.9×0 = 1, δ=1
w ← [0,0] + 0.5×1×[1,x] — depends on x feature of s.

Tile coding intuition

Split each dimension into intervals; tiles are overlapping hyper-rectangles. φ(s) is binary vector of active tiles — sparse, generalizes locally.

CartPole with angle bins: nearby angles share tiles → similar Q.

Batch methods (preview)

LSTD, least-squares TD solve for w in one shot from batch data — sample efficient but O(d²) memory. Useful when d is modest.

Common mistakes

Features not normalized — huge x dominates dot product.
Using same w for all actions when Q(s,a) differs sharply per action.
Forgetting terminal: no γ V̂(s′) at episode end.
Assuming semi-gradient always converges — off-policy linear FA can diverge (next lesson).

Linear FA scales to moderate problems and CartPole with good features. Neural networks replace hand-crafted φ with learned representations — same update shape, far more capacity, new instability modes.

Before this lesson

Previous lesson

What's next

Next lesson — Neural networks as approximators