← Back to curriculum

Module 6 — Actor–critic & PPO

Proximal policy optimization

Clipped surrogate objective, multiple epochs per batch, and PPO hyperparameters.

~70 min read + exercises

Proximal policy optimization

Before we begin

Proximal Policy Optimization (PPO) is the default on-policy algorithm for discrete and continuous control. It maximizes a clipped surrogate objective on fixed rollout batches, runs multiple epochs of minibatch SGD, and pairs naturally with GAE — the combination you will run on Lunar Lander.


Learning objectives

  • Write the PPO clipped objective with ratio r_t(θ).
  • Implement collect-rollout → GAE → multi-epoch update loop.
  • Tune clip ε, epochs, batch size, and entropy coefficient.
  • Log clip fraction and approximate KL.
  • Use Stable-Baselines3 PPO or a minimal from-scratch trainer.

Clipped surrogate loss

text
r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)
L^{CLIP} = E [ min( r_t A_t, clip(r_t, 1−ε, 1+ε) A_t ) ]

ε typically 0.2. If A_t > 0, r cannot exceed 1+ε. If A_t < 0, r cannot go below 1−ε. Pessimistic bound stops overly aggressive updates.

Full PPO loss (practice)

text
L = −L^{CLIP} + c_1 (V − V_target)² − c_2 H(π)
  • Value loss: MSE to GAE returns.
  • Entropy bonus: encourages exploration (c_2 ~ 0.01).
python
import torch
 
def ppo_loss(new_log_prob, old_log_prob, advantage, clip_eps=0.2):
    ratio = (new_log_prob - old_log_prob).exp()
    unclipped = ratio * advantage
    clipped = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantage
    return -torch.min(unclipped, clipped).mean()

Training loop outline

python
for iteration in range(num_iterations):
    # 1. Collect T steps per env (vectorized optional)
    rollout = collect_rollout(env, policy, steps=2048)
 
    # 2. Compute values, GAE advantages, returns
    adv, ret = compute_gae(rollout.rewards, rollout.values,
                           rollout.dones, gamma=0.99, lam=0.95)
    adv = normalize(adv)
 
    # 3. Store old log_probs
    old_log_probs = rollout.log_probs.detach()
 
    # 4. K epochs over shuffled minibatches
    for epoch in range(10):
        for batch in minibatches(rollout, batch_size=64):
            new_log_p, entropy, values = policy.evaluate(batch)
            pol_loss = ppo_loss(new_log_p, batch.old_log_p, batch.adv)
            val_loss = (values - batch.ret).pow(2).mean()
            loss = pol_loss + 0.5 * val_loss - 0.01 * entropy.mean()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Hyperparameters (LunarLander-v2 starting point)

ParamValueNotes
steps per rollout2048Per env; multiply by n_envs
PPO epochs4–10More epochs = more overfit risk
minibatch size64–256
clip ε0.2Try 0.1 if unstable
γ0.99
GAE λ0.95
learning rate3e-4linear decay optional
entropy coef0.01Reduce if policy too random late

Stable-Baselines3 defaults work well for benchmarking.

Worked example — clip in action

ε=0.2, A_t = +2 (good action), r_t would be 1.5 without clip.

text
unclipped = 1.5 × 2 = 3.0
clipped r = min(1.5, 1.2) = 1.2  →  1.2 × 2 = 2.4
L uses min for maximization flip → takes 2.4 path (pessimistic)

Prevents over-updating actions already more likely under new policy.

Monitoring

MetricHealthy
Clip fraction0.1–0.3 typical
Approx KL< 0.02 per update
Explained varianceCritic fits returns
Episode returnTrending up
python
clip_frac = ((ratio - 1.0).abs() > clip_eps).float().mean()

Checkpoint — details: PPO reuses the same rollout for several epochs — that is why clip matters; without it, multiple passes over data overfit the surrogate. Summary: Collect once, learn many times, but clip policy ratios so π does not run away.

Common mistakes

  1. Recomputing old_log_prob after policy update — must freeze from collection time.
  2. Too many epochs — clip fraction → 1.0, KL explodes.
  3. No advantage normalization — clip binds asymmetrically.
  4. Single env, tiny rollouts — high variance; use VectorEnv or parallel envs.
  5. Wrong sign on entropy — subtract entropy from loss (maximize entropy bonus).

Before this lesson


What's next