REINFORCE & the policy gradient theorem
Before we begin
REINFORCE (Monte Carlo policy gradient) updates θ using full-episode returns weighted by log-probabilities of actions taken. The policy gradient theorem justifies replacing a brutal trajectory derivative with an elegant expectation: ∇J ∝ E[∇ log π(a|s) · G_t].
Learning objectives
- Derive the REINFORCE update at an intuitive level (no measure theory required).
- Implement episodic REINFORCE with return-to-go G_t.
- Explain why log π turns policy optimization into a weighted score function estimator.
- Connect discount γ to return-to-go weighting.
- Recognize high variance as the main practical weakness.
Policy gradient theorem (intuition)
Goal: maximize J(θ) = E_π[G_0]. The trick:
∇_θ J(θ) = E_π [ sum_t ∇_θ log π(a_t|s_t; θ) · G_t ]- ∇ log π points in parameter space that increase probability of action a_t in state s_t.
- G_t is return from time t — if the episode went well, amplify actions that led there; if poorly, suppress them.
- Expectation estimated by averaging over episodes (Monte Carlo).
REINFORCE algorithm
for each episode:
generate τ = (s_0,a_0,r_1, ..., s_T) using π_θ
for each t:
G_t = sum_{k=t}^{T} γ^{k-t} r_{k+1}
for each t:
θ ← θ + α · γ^t · ∇_θ log π(a_t|s_t; θ) · G_tOften use return-to-go from t to end without extra γ^t on the gradient (implementations vary — be consistent within one codebase).
PyTorch REINFORCE (one episode)
import torch
def reinforce_episode(policy, optimizer, episode, gamma=0.99):
log_probs = []
rewards = []
for obs, action, reward in episode:
obs_t = torch.tensor(obs, dtype=torch.float32)
dist = policy(obs_t)
a = torch.tensor(action)
log_probs.append(dist.log_prob(a))
rewards.append(reward)
# returns-to-go
G = 0.0
returns = []
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
# optional: normalize returns for stability
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
loss = []
for log_p, Gt in zip(log_probs, returns):
loss.append(-log_p * Gt) # minimize negative objective
loss = torch.stack(loss).sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()Worked example — two-step episode
Actions a_0, a_1 with log π = −0.5, −0.3. Rewards r_1=0, r_2=+1. γ=1.
| t | G_t | ∇θ log π | Update direction |
|---|---|---|---|
| 0 | 1.0 | increases a_0 prob | positive (good episode) |
| 1 | 1.0 | increases a_1 prob | positive |
If final reward were −1, same gradients flip sign — actions made less likely.
Log-derivative trick (why log?)
∇_θ π(a|s; θ) = π(a|s; θ) ∇_θ log π(a|s; θ)Rewriting the gradient as an expectation of score function ∇ log π times reward avoids differentiating through the environment transition dynamics — only the policy needs to be differentiable.
Variance and sample complexity
REINFORCE uses full episode returns — early actions get credit for late randomness. Variance grows with horizon. Typical CartPole needs thousands of episodes; same env with actor–critic (Module 5–6) learns faster.
| Property | REINFORCE | Q-learning |
|---|---|---|
| Bootstrapping | No (MC) | Yes (TD) |
| Variance | High | Lower (with replay) |
| On-policy | Yes | Off-policy |
| Step complexity | Simple | Replay + targets |
Checkpoint — details: REINFORCE is the cleanest proof of concept for policy gradients — if it never learns CartPole, check return normalization and learning rate before adding critics. Summary: ∇J uses log π weighted by how good the rest of the episode was.
Common mistakes
- Backprop through environment — only policy parameters get gradients.
- Using total episode return for every t — should be G_t from t onward, not G_0 for all t.
- No return normalization — gradients explode when returns scale varies.
- Learning rate too high — policy collapse to near-deterministic bad actions.
- Off-policy data with REINFORCE — standard REINFORCE assumes on-policy samples.