Proximal policy optimization

Before we begin

Proximal Policy Optimization (PPO) is the default on-policy algorithm for discrete and continuous control. It maximizes a clipped surrogate objective on fixed rollout batches, runs multiple epochs of minibatch SGD, and pairs naturally with GAE — the combination you will run on Lunar Lander.

Learning objectives

Write the PPO clipped objective with ratio r_t(θ).
Implement collect-rollout → GAE → multi-epoch update loop.
Tune clip ε, epochs, batch size, and entropy coefficient.
Log clip fraction and approximate KL.
Use Stable-Baselines3 PPO or a minimal from-scratch trainer.

Clipped surrogate loss

text

r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)
L^{CLIP} = E [ min( r_t A_t, clip(r_t, 1−ε, 1+ε) A_t ) ]

ε typically 0.2. If A_t > 0, r cannot exceed 1+ε. If A_t < 0, r cannot go below 1−ε. Pessimistic bound stops overly aggressive updates.

Full PPO loss (practice)

text

L = −L^{CLIP} + c_1 (V − V_target)² − c_2 H(π)

Value loss: MSE to GAE returns.
Entropy bonus: encourages exploration (c_2 ~ 0.01).

python

import torch
 
def ppo_loss(new_log_prob, old_log_prob, advantage, clip_eps=0.2):
    ratio = (new_log_prob - old_log_prob).exp()
    unclipped = ratio * advantage
    clipped = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantage
    return -torch.min(unclipped, clipped).mean()

Training loop outline

python

for iteration in range(num_iterations):
    # 1. Collect T steps per env (vectorized optional)
    rollout = collect_rollout(env, policy, steps=2048)
 
    # 2. Compute values, GAE advantages, returns
    adv, ret = compute_gae(rollout.rewards, rollout.values,
                           rollout.dones, gamma=0.99, lam=0.95)
    adv = normalize(adv)
 
    # 3. Store old log_probs
    old_log_probs = rollout.log_probs.detach()
 
    # 4. K epochs over shuffled minibatches
    for epoch in range(10):
        for batch in minibatches(rollout, batch_size=64):
            new_log_p, entropy, values = policy.evaluate(batch)
            pol_loss = ppo_loss(new_log_p, batch.old_log_p, batch.adv)
            val_loss = (values - batch.ret).pow(2).mean()
            loss = pol_loss + 0.5 * val_loss - 0.01 * entropy.mean()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Hyperparameters (LunarLander-v2 starting point)

Param	Value	Notes
steps per rollout	2048	Per env; multiply by n_envs
PPO epochs	4–10	More epochs = more overfit risk
minibatch size	64–256
clip ε	0.2	Try 0.1 if unstable
γ	0.99
GAE λ	0.95
learning rate	3e-4	linear decay optional
entropy coef	0.01	Reduce if policy too random late

Stable-Baselines3 defaults work well for benchmarking.

Worked example — clip in action

ε=0.2, A_t = +2 (good action), r_t would be 1.5 without clip.

text

unclipped = 1.5 × 2 = 3.0
clipped r = min(1.5, 1.2) = 1.2  →  1.2 × 2 = 2.4
L uses min for maximization flip → takes 2.4 path (pessimistic)

Prevents over-updating actions already more likely under new policy.

Monitoring

Metric	Healthy
Clip fraction	0.1–0.3 typical
Approx KL	< 0.02 per update
Explained variance	Critic fits returns
Episode return	Trending up

python

clip_frac = ((ratio - 1.0).abs() > clip_eps).float().mean()

Checkpoint — details: PPO reuses the same rollout for several epochs — that is why clip matters; without it, multiple passes over data overfit the surrogate. Summary: Collect once, learn many times, but clip policy ratios so π does not run away.

Common mistakes

Recomputing old_log_prob after policy update — must freeze from collection time.
Too many epochs — clip fraction → 1.0, KL explodes.
No advantage normalization — clip binds asymmetrically.
Single env, tiny rollouts — high variance; use VectorEnv or parallel envs.
Wrong sign on entropy — subtract entropy from loss (maximize entropy bonus).

Before this lesson

Previous lesson

What's next

Next lesson — A2C & parallel RL