Baseline & variance reduction

Before we begin

REINFORCE's gradient is correct in expectation but noisy. Subtracting a baseline b(s) that does not depend on the action cuts variance without introducing bias. The most common baseline is a learned value function V(s; w) — the bridge to actor–critic methods.

Learning objectives

Show that subtracting action-independent baselines leaves the gradient unbiased.
Use reward-to-go minus V(s_t) as the advantage signal.
Implement a constant baseline (batch mean return) as a quick win.
Interpret advantage A_t > 0 as "action better than expected."
Prepare for critic networks in the next lesson.

Baseline math (intuition)

Replace G_t with (G_t − b(s_t)):

text

E [ ∇ log π(a|s) · b(s) ] = 0   when b(s) does not depend on a

Proof sketch: sum_a π(a|s) ∇ log π(a|s) = ∇ sum_a π(a|s) = ∇ 1 = 0. So baselines remove noise, not signal.

Constant baseline — quick experiment

python

def reinforce_with_baseline(log_probs, returns):
    baseline = returns.mean()
    advantages = returns - baseline
    loss = sum(-lp * adv for lp, adv in zip(log_probs, advantages))
    return loss

Subtracting batch mean return often speeds CartPole learning 2–3× with one line.

State-dependent baseline V(s)

Learn V(s; w) with regression to returns or TD targets. Use advantage:

text

A_t = G_t − V(s_t)

Policy update: ∇ log π(a_t|s_t) · A_t. Critic update: minimize (V(s_t) − G_t)² or TD error.

Worked example — numeric baseline

Three episodes end with returns 80, 100, 60. Constant baseline b = 80.

Episode	G_0	G_0 − b	Effect on gradient
1	80	0	near-zero update
2	100	+20	strengthen trajectory
3	60	−20	weaken trajectory

Relative ranking matters more than absolute return scale — pairs with normalization.

PyTorch: actor step with learned baseline

python

class ValueNet(nn.Module):
    def __init__(self, obs_dim, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )
 
    def forward(self, obs):
        return self.net(obs).squeeze(-1)
 
def actor_loss(log_probs, values, returns):
    advantages = returns - values.detach()
    return -(log_probs * advantages).sum()
 
def critic_loss(values, returns):
    return ((values - returns) ** 2).mean()

.detach() on values for actor — critic learns separately; actor treats V as fixed baseline during policy step.

Other variance reduction tools (preview)

Technique	What it does
Baseline V(s)	Center advantages per state
Reward normalization	Scale returns batch-wise
Entropy bonus	Encourage exploration, prevent collapse
GAE (Module 6)	Bias–variance tradeoff for multi-step advantage

Checkpoint — details: If REINFORCE learns but training curves look like a seismograph, add baseline before touching network depth. Summary: Subtract what you expected from what you got — update only the surprise.

Common mistakes

Baseline that depends on action — reintroduces bias; V(s) must not see which action was taken for the baseline term.
Not detaching critic for actor loss — actor incorrectly backprops into critic through advantage.
Critic much faster than actor — advantages near zero; balance learning rates.
Using G_0 for all timesteps without bootstrapping — long horizons need TD or GAE, not raw MC.
Advantage without normalization — large |A| still causes unstable policy updates.

Before this lesson

Previous lesson

What's next

Next lesson — Actor–critic architecture