Project: CartPole with linear FA

Before we begin

Solve CartPole-v1 using linear function approximation — hand-crafted features and a linear model for action values or state values. This bridges tabular Q-learning (Module 2) and neural DQN (Module 4) without jumping straight to deep nets.

How this connects to Module 3

Lesson	Where you use it
Why tabular breaks	CartPole has continuous-ish observations
Linear FA	Q̂(s,a) = wᵀ φ(s,a) or V̂(s) = wᵀ φ(s)
Neural nets as approximators	Preview: same structure, nonlinear φ
Deadly triad	Why you still use target-style updates carefully

What you will build

Piece	Purpose
Feature map φ(s)	Polynomial / tile features over (x, ẋ, θ, θ̇)
Linear Q or SARSA(λ) with FA	Weight vector w updated each step
Learning curve	Episode return vs episode

Folder layout:

text

cartpole-linear-fa/
  features.py        # φ(s) and φ(s,a)
  linear_agent.py
  train.py
  outputs/returns.png

Estimated time: 3–5 hours.

Before you start

Finish the Module 3 quiz.
pip install gymnasium numpy matplotlib

Step 1 — Features for CartPole

CartPole observation: [cart_pos, cart_vel, pole_angle, pole_vel]. Build expanded features:

python

import numpy as np
 
def phi(state: np.ndarray) -> np.ndarray:
    x, x_dot, theta, theta_dot = state
    return np.array([
        1.0, x, x_dot, theta, theta_dot,
        x**2, theta**2, x * theta,
        np.sin(theta), np.cos(theta),
    ])

For semi-gradient Q-learning with 2 actions, use phi(s, a): concatenate φ(s) with a one-hot action block, or maintain separate weight vectors per action.

Step 2 — Linear Q-learning update

python

class LinearQLearning:
    def __init__(self, n_features, n_actions, alpha=1e-4, gamma=0.99, epsilon=0.1):
        self.w = np.zeros((n_actions, n_features))
        self.alpha, self.gamma, self.epsilon = alpha, gamma, epsilon
        self.n_actions = n_actions
 
    def q_values(self, state):
        f = phi(state)
        return self.w @ f
 
    def act(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(self.q_values(state)))
 
    def learn(self, s, a, r, s_next, done):
        f = phi(s)
        q_sa = self.w[a] @ f
        q_next = 0.0 if done else np.max(self.q_values(s_next))
        td_error = r + self.gamma * q_next - q_sa
        self.w[a] += self.alpha * td_error * f

Tune alpha — linear FA is sensitive (try 1e-5 … 1e-3).

Step 3 — Training

python

import gymnasium as gym
 
env = gym.make("CartPole-v1")
agent = LinearQLearning(n_features=10, n_actions=2)
returns = []
 
for ep in range(1000):
    s, _ = env.reset(seed=ep)
    total = 0.0
    done = False
    while not done:
        a = agent.act(s)
        s_next, r, term, trunc, _ = env.step(a)
        done = term or trunc
        agent.learn(s, a, r, s_next, done)
        s = s_next
        total += r
    returns.append(total)
    agent.epsilon = max(0.01, agent.epsilon * 0.995)

Success criteria

Criterion	Target
Mean return over last 100 episodes ≥ 195	CartPole "solved" threshold
README notes which features mattered most	Required
Compare random baseline vs trained	Required

Extension ideas

Tile coding instead of polynomials (coarse bins per dimension).
Replace linear model with a 1-hidden-layer net (preview Module 4).

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.