Linear function approximation
Before we begin
Linear function approximation represents value or Q as a dot product of features φ(s) and weights w. Updates are often convex (for linear TD with appropriate targets) and interpretable — you can inspect which features drive value.
Many production systems used linear FA for years; understanding it clarifies what neural nets replace and what can still go wrong.
What you will learn
- Write V̂(s) = wᵀ φ(s) and Q̂(s,a) = w_aᵀ φ(s).
- Apply semi-gradient TD updates for value prediction.
- Implement linear Q-learning with tile coding intuition.
- Compare batch vs stochastic gradient updates.
- Know when linear models suffice vs need deep nets.
Linear value function
V̂(s; w) = Σᵢ wᵢ φᵢ(s) = wᵀ φ(s)
Features φ(s) can be raw obs, polynomials, Fourier basis, tile coding, RBFs.
| Feature choice | Pros | Cons |
|---|---|---|
| Raw state | Simple | May not generalize |
| Polynomials | Smooth functions | Curse of dimensionality |
| Tile coding | Coarse generalization | Hand-tuned bins |
Semi-gradient TD(0) update
Target y = r + γ V̂(s′; w) — treat target as fixed for gradient:
w ← w + α [ y − V̂(s; w) ] ∇_w V̂(s; w)
For linear V̂, ∇_w V̂(s) = φ(s):
w ← w + α δ φ(s), δ = r + γ wᵀφ(s′) − wᵀφ(s)
import numpy as np
def linear_td0(w, phi_s, phi_s_next, r, gamma, alpha, terminal=False):
v = w @ phi_s
v_next = 0.0 if terminal else w @ phi_s_next
delta = r + gamma * v_next - v
w += alpha * delta * phi_s
return w, deltaNot true gradient descent on a fixed loss — semi-gradient because target depends on w.
Linear Q-learning
Separate w_a per action:
Q̂(s, a) = w_aᵀ φ(s)
Update for taken action a:
w_a ← w_a + α [ r + γ max_a′ w_a′ᵀ φ(s′) − w_aᵀ φ(s) ] φ(s)
def linear_q_step(W, phi_s, phi_s_next, s_idx, a, r, gamma, alpha, terminal):
q = W[a] @ phi_s
if terminal:
target = r
else:
target = r + gamma * np.max(W @ phi_s_next)
W[a] += alpha * (target - q) * phi_sWorked numeric step
φ(s) = [1, x], w = [0, 0], r=1, φ(s′)=[1, 2], γ=0.9, α=0.5, non-terminal.
V̂(s)=0, target = 1 + 0.9×0 = 1, δ=1
w ← [0,0] + 0.5×1×[1,x] — depends on x feature of s.
Tile coding intuition
Split each dimension into intervals; tiles are overlapping hyper-rectangles. φ(s) is binary vector of active tiles — sparse, generalizes locally.
CartPole with angle bins: nearby angles share tiles → similar Q.
Batch methods (preview)
LSTD, least-squares TD solve for w in one shot from batch data — sample efficient but O(d²) memory. Useful when d is modest.
Common mistakes
- Features not normalized — huge x dominates dot product.
- Using same w for all actions when Q(s,a) differs sharply per action.
- Forgetting terminal: no γ V̂(s′) at episode end.
- Assuming semi-gradient always converges — off-policy linear FA can diverge (next lesson).
Linear FA scales to moderate problems and CartPole with good features. Neural networks replace hand-crafted φ with learned representations — same update shape, far more capacity, new instability modes.