← Back to curriculum

Module 5 — Policy gradients

Baseline & variance reduction

State-dependent baselines, advantage intuition, and reward-to-go.

~60 min read + exercises

Baseline & variance reduction

Before we begin

REINFORCE's gradient is correct in expectation but noisy. Subtracting a baseline b(s) that does not depend on the action cuts variance without introducing bias. The most common baseline is a learned value function V(s; w) — the bridge to actor–critic methods.


Learning objectives

  • Show that subtracting action-independent baselines leaves the gradient unbiased.
  • Use reward-to-go minus V(s_t) as the advantage signal.
  • Implement a constant baseline (batch mean return) as a quick win.
  • Interpret advantage A_t > 0 as "action better than expected."
  • Prepare for critic networks in the next lesson.

Baseline math (intuition)

Replace G_t with (G_t − b(s_t)):

text
E [ ∇ log π(a|s) · b(s) ] = 0   when b(s) does not depend on a

Proof sketch: sum_a π(a|s) ∇ log π(a|s) = ∇ sum_a π(a|s) = ∇ 1 = 0. So baselines remove noise, not signal.

Constant baseline — quick experiment

python
def reinforce_with_baseline(log_probs, returns):
    baseline = returns.mean()
    advantages = returns - baseline
    loss = sum(-lp * adv for lp, adv in zip(log_probs, advantages))
    return loss

Subtracting batch mean return often speeds CartPole learning 2–3× with one line.

State-dependent baseline V(s)

Learn V(s; w) with regression to returns or TD targets. Use advantage:

text
A_t = G_t − V(s_t)

Policy update: ∇ log π(a_t|s_t) · A_t. Critic update: minimize (V(s_t) − G_t)² or TD error.

Worked example — numeric baseline

Three episodes end with returns 80, 100, 60. Constant baseline b = 80.

EpisodeG_0G_0 − bEffect on gradient
1800near-zero update
2100+20strengthen trajectory
360−20weaken trajectory

Relative ranking matters more than absolute return scale — pairs with normalization.

PyTorch: actor step with learned baseline

python
class ValueNet(nn.Module):
    def __init__(self, obs_dim, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )
 
    def forward(self, obs):
        return self.net(obs).squeeze(-1)
 
def actor_loss(log_probs, values, returns):
    advantages = returns - values.detach()
    return -(log_probs * advantages).sum()
 
def critic_loss(values, returns):
    return ((values - returns) ** 2).mean()

.detach() on values for actor — critic learns separately; actor treats V as fixed baseline during policy step.

Other variance reduction tools (preview)

TechniqueWhat it does
Baseline V(s)Center advantages per state
Reward normalizationScale returns batch-wise
Entropy bonusEncourage exploration, prevent collapse
GAE (Module 6)Bias–variance tradeoff for multi-step advantage

Checkpoint — details: If REINFORCE learns but training curves look like a seismograph, add baseline before touching network depth. Summary: Subtract what you expected from what you got — update only the surprise.

Common mistakes

  1. Baseline that depends on action — reintroduces bias; V(s) must not see which action was taken for the baseline term.
  2. Not detaching critic for actor loss — actor incorrectly backprops into critic through advantage.
  3. Critic much faster than actor — advantages near zero; balance learning rates.
  4. Using G_0 for all timesteps without bootstrapping — long horizons need TD or GAE, not raw MC.
  5. Advantage without normalization — large |A| still causes unstable policy updates.

Before this lesson


What's next