← Back to curriculum

Module 5 — Policy gradients

Project: REINFORCE on CartPole

Implement REINFORCE with baseline; plot return and policy entropy.

~130 min read + exercises

Project: REINFORCE on CartPole

Before we begin

Implement REINFORCE — Monte Carlo policy gradients — on CartPole-v1. You optimize the policy directly (no Q-table), using full-episode returns and the log-probability trick from Module 5.


How this connects to Module 5

LessonWhere you use it
Why learn policiesSoftmax policy over discrete actions
REINFORCE∇J ≈ Σₜ ∇ log π(aₜ
BaselineSubtract mean return to cut variance (stretch)
Actor–criticPreview: replace Gₜ with advantage

What you will build

PiecePurpose
PolicyNetworkMLP → action logits → Categorical
Episode bufferStore log_probs and rewards per step
REINFORCE updateLoss = −Σ log π(a

Estimated time: 4–5 hours.


Before you start

  • Finish the Module 5 quiz.
  • pip install gymnasium torch numpy matplotlib

Step 1 — Policy network

python
import torch
import torch.nn as nn
from torch.distributions import Categorical
 
class PolicyNetwork(nn.Module):
    def __init__(self, obs_dim=4, n_actions=2, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, n_actions),
        )
 
    def forward(self, x):
        return Categorical(logits=self.net(x))

Step 2 — One episode → policy gradient

python
import numpy as np
 
def reinforce_episode(env, policy, optimizer, gamma=0.99):
    log_probs, rewards = [], []
    s, _ = env.reset()
    done = False
    while not done:
        s_t = torch.tensor(s, dtype=torch.float32)
        dist = policy(s_t)
        a = dist.sample()
        log_probs.append(dist.log_prob(a))
        s, r, term, trunc, _ = env.step(int(a.item()))
        rewards.append(r)
        done = term or trunc
 
    # discounted returns G_t
    G = 0.0
    returns = []
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns, dtype=torch.float32)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # baseline-free normalize
 
    loss = sum(-lp * Gt for lp, Gt in zip(log_probs, returns))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return sum(rewards)

Step 3 — Training loop

Train for 500–1000 episodes. REINFORCE is high-variance — learning curves will be noisy. Plot raw returns and a rolling mean (window 50).


Success criteria

CriterionTarget
Policy gradient implemented (no value network cheating)Required
Mean return last 100 episodes ≥ 195Typical with normalization
README compares with vs without return normalizationRecommended

Extension ideas

  • Learned value baseline (actor–critic, one step toward Module 6).
  • Entropy bonus for exploration.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.