Project: REINFORCE on CartPole

Before we begin

Implement REINFORCE — Monte Carlo policy gradients — on CartPole-v1. You optimize the policy directly (no Q-table), using full-episode returns and the log-probability trick from Module 5.

How this connects to Module 5

Lesson	Where you use it
Why learn policies	Softmax policy over discrete actions
REINFORCE	∇J ≈ Σₜ ∇ log π(aₜ
Baseline	Subtract mean return to cut variance (stretch)
Actor–critic	Preview: replace Gₜ with advantage

What you will build

Piece	Purpose
`PolicyNetwork`	MLP → action logits → Categorical
Episode buffer	Store log_probs and rewards per step
REINFORCE update	Loss = −Σ log π(a

Estimated time: 4–5 hours.

Before you start

Finish the Module 5 quiz.
pip install gymnasium torch numpy matplotlib

Step 1 — Policy network

python

import torch
import torch.nn as nn
from torch.distributions import Categorical
 
class PolicyNetwork(nn.Module):
    def __init__(self, obs_dim=4, n_actions=2, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, n_actions),
        )
 
    def forward(self, x):
        return Categorical(logits=self.net(x))

Step 2 — One episode → policy gradient

python

import numpy as np
 
def reinforce_episode(env, policy, optimizer, gamma=0.99):
    log_probs, rewards = [], []
    s, _ = env.reset()
    done = False
    while not done:
        s_t = torch.tensor(s, dtype=torch.float32)
        dist = policy(s_t)
        a = dist.sample()
        log_probs.append(dist.log_prob(a))
        s, r, term, trunc, _ = env.step(int(a.item()))
        rewards.append(r)
        done = term or trunc
 
    # discounted returns G_t
    G = 0.0
    returns = []
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns, dtype=torch.float32)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # baseline-free normalize
 
    loss = sum(-lp * Gt for lp, Gt in zip(log_probs, returns))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return sum(rewards)

Step 3 — Training loop

Train for 500–1000 episodes. REINFORCE is high-variance — learning curves will be noisy. Plot raw returns and a rolling mean (window 50).

Success criteria

Criterion	Target
Policy gradient implemented (no value network cheating)	Required
Mean return last 100 episodes ≥ 195	Typical with normalization
README compares with vs without return normalization	Recommended

Extension ideas

Learned value baseline (actor–critic, one step toward Module 6).
Entropy bonus for exploration.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.