Baseline & variance reduction
Before we begin
REINFORCE's gradient is correct in expectation but noisy. Subtracting a baseline b(s) that does not depend on the action cuts variance without introducing bias. The most common baseline is a learned value function V(s; w) — the bridge to actor–critic methods.
Learning objectives
- Show that subtracting action-independent baselines leaves the gradient unbiased.
- Use reward-to-go minus V(s_t) as the advantage signal.
- Implement a constant baseline (batch mean return) as a quick win.
- Interpret advantage A_t > 0 as "action better than expected."
- Prepare for critic networks in the next lesson.
Baseline math (intuition)
Replace G_t with (G_t − b(s_t)):
E [ ∇ log π(a|s) · b(s) ] = 0 when b(s) does not depend on aProof sketch: sum_a π(a|s) ∇ log π(a|s) = ∇ sum_a π(a|s) = ∇ 1 = 0. So baselines remove noise, not signal.
Constant baseline — quick experiment
def reinforce_with_baseline(log_probs, returns):
baseline = returns.mean()
advantages = returns - baseline
loss = sum(-lp * adv for lp, adv in zip(log_probs, advantages))
return lossSubtracting batch mean return often speeds CartPole learning 2–3× with one line.
State-dependent baseline V(s)
Learn V(s; w) with regression to returns or TD targets. Use advantage:
A_t = G_t − V(s_t)Policy update: ∇ log π(a_t|s_t) · A_t. Critic update: minimize (V(s_t) − G_t)² or TD error.
Worked example — numeric baseline
Three episodes end with returns 80, 100, 60. Constant baseline b = 80.
| Episode | G_0 | G_0 − b | Effect on gradient |
|---|---|---|---|
| 1 | 80 | 0 | near-zero update |
| 2 | 100 | +20 | strengthen trajectory |
| 3 | 60 | −20 | weaken trajectory |
Relative ranking matters more than absolute return scale — pairs with normalization.
PyTorch: actor step with learned baseline
class ValueNet(nn.Module):
def __init__(self, obs_dim, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, 1),
)
def forward(self, obs):
return self.net(obs).squeeze(-1)
def actor_loss(log_probs, values, returns):
advantages = returns - values.detach()
return -(log_probs * advantages).sum()
def critic_loss(values, returns):
return ((values - returns) ** 2).mean().detach() on values for actor — critic learns separately; actor treats V as fixed baseline during policy step.
Other variance reduction tools (preview)
| Technique | What it does |
|---|---|
| Baseline V(s) | Center advantages per state |
| Reward normalization | Scale returns batch-wise |
| Entropy bonus | Encourage exploration, prevent collapse |
| GAE (Module 6) | Bias–variance tradeoff for multi-step advantage |
Checkpoint — details: If REINFORCE learns but training curves look like a seismograph, add baseline before touching network depth. Summary: Subtract what you expected from what you got — update only the surprise.
Common mistakes
- Baseline that depends on action — reintroduces bias; V(s) must not see which action was taken for the baseline term.
- Not detaching critic for actor loss — actor incorrectly backprops into critic through advantage.
- Critic much faster than actor — advantages near zero; balance learning rates.
- Using G_0 for all timesteps without bootstrapping — long horizons need TD or GAE, not raw MC.
- Advantage without normalization — large |A| still causes unstable policy updates.