Project: REINFORCE on CartPole
Before we begin
Implement REINFORCE — Monte Carlo policy gradients — on CartPole-v1. You optimize the policy directly (no Q-table), using full-episode returns and the log-probability trick from Module 5.
How this connects to Module 5
| Lesson | Where you use it |
|---|---|
| Why learn policies | Softmax policy over discrete actions |
| REINFORCE | ∇J ≈ Σₜ ∇ log π(aₜ |
| Baseline | Subtract mean return to cut variance (stretch) |
| Actor–critic | Preview: replace Gₜ with advantage |
What you will build
| Piece | Purpose |
|---|---|
PolicyNetwork | MLP → action logits → Categorical |
| Episode buffer | Store log_probs and rewards per step |
| REINFORCE update | Loss = −Σ log π(a |
Estimated time: 4–5 hours.
Before you start
- Finish the Module 5 quiz.
pip install gymnasium torch numpy matplotlib
Step 1 — Policy network
python
import torch
import torch.nn as nn
from torch.distributions import Categorical
class PolicyNetwork(nn.Module):
def __init__(self, obs_dim=4, n_actions=2, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def forward(self, x):
return Categorical(logits=self.net(x))Step 2 — One episode → policy gradient
python
import numpy as np
def reinforce_episode(env, policy, optimizer, gamma=0.99):
log_probs, rewards = [], []
s, _ = env.reset()
done = False
while not done:
s_t = torch.tensor(s, dtype=torch.float32)
dist = policy(s_t)
a = dist.sample()
log_probs.append(dist.log_prob(a))
s, r, term, trunc, _ = env.step(int(a.item()))
rewards.append(r)
done = term or trunc
# discounted returns G_t
G = 0.0
returns = []
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
returns = (returns - returns.mean()) / (returns.std() + 1e-8) # baseline-free normalize
loss = sum(-lp * Gt for lp, Gt in zip(log_probs, returns))
optimizer.zero_grad()
loss.backward()
optimizer.step()
return sum(rewards)Step 3 — Training loop
Train for 500–1000 episodes. REINFORCE is high-variance — learning curves will be noisy. Plot raw returns and a rolling mean (window 50).
Success criteria
| Criterion | Target |
|---|---|
| Policy gradient implemented (no value network cheating) | Required |
| Mean return last 100 episodes ≥ 195 | Typical with normalization |
| README compares with vs without return normalization | Recommended |
Extension ideas
- Learned value baseline (actor–critic, one step toward Module 6).
- Entropy bonus for exploration.
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.