Project: CartPole with linear FA
Before we begin
Solve CartPole-v1 using linear function approximation — hand-crafted features and a linear model for action values or state values. This bridges tabular Q-learning (Module 2) and neural DQN (Module 4) without jumping straight to deep nets.
How this connects to Module 3
| Lesson | Where you use it |
|---|---|
| Why tabular breaks | CartPole has continuous-ish observations |
| Linear FA | Q̂(s,a) = wᵀ φ(s,a) or V̂(s) = wᵀ φ(s) |
| Neural nets as approximators | Preview: same structure, nonlinear φ |
| Deadly triad | Why you still use target-style updates carefully |
What you will build
| Piece | Purpose |
|---|---|
| Feature map φ(s) | Polynomial / tile features over (x, ẋ, θ, θ̇) |
| Linear Q or SARSA(λ) with FA | Weight vector w updated each step |
| Learning curve | Episode return vs episode |
Folder layout:
cartpole-linear-fa/
features.py # φ(s) and φ(s,a)
linear_agent.py
train.py
outputs/returns.pngEstimated time: 3–5 hours.
Before you start
- Finish the Module 3 quiz.
pip install gymnasium numpy matplotlib
Step 1 — Features for CartPole
CartPole observation: [cart_pos, cart_vel, pole_angle, pole_vel]. Build expanded features:
import numpy as np
def phi(state: np.ndarray) -> np.ndarray:
x, x_dot, theta, theta_dot = state
return np.array([
1.0, x, x_dot, theta, theta_dot,
x**2, theta**2, x * theta,
np.sin(theta), np.cos(theta),
])For semi-gradient Q-learning with 2 actions, use phi(s, a): concatenate φ(s) with a one-hot action block, or maintain separate weight vectors per action.
Step 2 — Linear Q-learning update
class LinearQLearning:
def __init__(self, n_features, n_actions, alpha=1e-4, gamma=0.99, epsilon=0.1):
self.w = np.zeros((n_actions, n_features))
self.alpha, self.gamma, self.epsilon = alpha, gamma, epsilon
self.n_actions = n_actions
def q_values(self, state):
f = phi(state)
return self.w @ f
def act(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return int(np.argmax(self.q_values(state)))
def learn(self, s, a, r, s_next, done):
f = phi(s)
q_sa = self.w[a] @ f
q_next = 0.0 if done else np.max(self.q_values(s_next))
td_error = r + self.gamma * q_next - q_sa
self.w[a] += self.alpha * td_error * fTune alpha — linear FA is sensitive (try 1e-5 … 1e-3).
Step 3 — Training
import gymnasium as gym
env = gym.make("CartPole-v1")
agent = LinearQLearning(n_features=10, n_actions=2)
returns = []
for ep in range(1000):
s, _ = env.reset(seed=ep)
total = 0.0
done = False
while not done:
a = agent.act(s)
s_next, r, term, trunc, _ = env.step(a)
done = term or trunc
agent.learn(s, a, r, s_next, done)
s = s_next
total += r
returns.append(total)
agent.epsilon = max(0.01, agent.epsilon * 0.995)Success criteria
| Criterion | Target |
|---|---|
| Mean return over last 100 episodes ≥ 195 | CartPole "solved" threshold |
| README notes which features mattered most | Required |
| Compare random baseline vs trained | Required |
Extension ideas
- Tile coding instead of polynomials (coarse bins per dimension).
- Replace linear model with a 1-hidden-layer net (preview Module 4).
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.