Proximal policy optimization
Before we begin
Proximal Policy Optimization (PPO) is the default on-policy algorithm for discrete and continuous control. It maximizes a clipped surrogate objective on fixed rollout batches, runs multiple epochs of minibatch SGD, and pairs naturally with GAE — the combination you will run on Lunar Lander.
Learning objectives
- Write the PPO clipped objective with ratio r_t(θ).
- Implement collect-rollout → GAE → multi-epoch update loop.
- Tune clip ε, epochs, batch size, and entropy coefficient.
- Log clip fraction and approximate KL.
- Use Stable-Baselines3 PPO or a minimal from-scratch trainer.
Clipped surrogate loss
r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)
L^{CLIP} = E [ min( r_t A_t, clip(r_t, 1−ε, 1+ε) A_t ) ]ε typically 0.2. If A_t > 0, r cannot exceed 1+ε. If A_t < 0, r cannot go below 1−ε. Pessimistic bound stops overly aggressive updates.
Full PPO loss (practice)
L = −L^{CLIP} + c_1 (V − V_target)² − c_2 H(π)- Value loss: MSE to GAE returns.
- Entropy bonus: encourages exploration (c_2 ~ 0.01).
import torch
def ppo_loss(new_log_prob, old_log_prob, advantage, clip_eps=0.2):
ratio = (new_log_prob - old_log_prob).exp()
unclipped = ratio * advantage
clipped = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantage
return -torch.min(unclipped, clipped).mean()Training loop outline
for iteration in range(num_iterations):
# 1. Collect T steps per env (vectorized optional)
rollout = collect_rollout(env, policy, steps=2048)
# 2. Compute values, GAE advantages, returns
adv, ret = compute_gae(rollout.rewards, rollout.values,
rollout.dones, gamma=0.99, lam=0.95)
adv = normalize(adv)
# 3. Store old log_probs
old_log_probs = rollout.log_probs.detach()
# 4. K epochs over shuffled minibatches
for epoch in range(10):
for batch in minibatches(rollout, batch_size=64):
new_log_p, entropy, values = policy.evaluate(batch)
pol_loss = ppo_loss(new_log_p, batch.old_log_p, batch.adv)
val_loss = (values - batch.ret).pow(2).mean()
loss = pol_loss + 0.5 * val_loss - 0.01 * entropy.mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()Hyperparameters (LunarLander-v2 starting point)
| Param | Value | Notes |
|---|---|---|
| steps per rollout | 2048 | Per env; multiply by n_envs |
| PPO epochs | 4–10 | More epochs = more overfit risk |
| minibatch size | 64–256 | |
| clip ε | 0.2 | Try 0.1 if unstable |
| γ | 0.99 | |
| GAE λ | 0.95 | |
| learning rate | 3e-4 | linear decay optional |
| entropy coef | 0.01 | Reduce if policy too random late |
Stable-Baselines3 defaults work well for benchmarking.
Worked example — clip in action
ε=0.2, A_t = +2 (good action), r_t would be 1.5 without clip.
unclipped = 1.5 × 2 = 3.0
clipped r = min(1.5, 1.2) = 1.2 → 1.2 × 2 = 2.4
L uses min for maximization flip → takes 2.4 path (pessimistic)Prevents over-updating actions already more likely under new policy.
Monitoring
| Metric | Healthy |
|---|---|
| Clip fraction | 0.1–0.3 typical |
| Approx KL | < 0.02 per update |
| Explained variance | Critic fits returns |
| Episode return | Trending up |
clip_frac = ((ratio - 1.0).abs() > clip_eps).float().mean()Checkpoint — details: PPO reuses the same rollout for several epochs — that is why clip matters; without it, multiple passes over data overfit the surrogate. Summary: Collect once, learn many times, but clip policy ratios so π does not run away.
Common mistakes
- Recomputing old_log_prob after policy update — must freeze from collection time.
- Too many epochs — clip fraction → 1.0, KL explodes.
- No advantage normalization — clip binds asymmetrically.
- Single env, tiny rollouts — high variance; use VectorEnv or parallel envs.
- Wrong sign on entropy — subtract entropy from loss (maximize entropy bonus).