← Back to curriculum

Module 6 — Actor–critic & PPO

TRPO intuition

Trust regions, KL constraints, and why naive policy gradients destabilize.

~60 min read + exercises

TRPO intuition

Before we begin

Trust Region Policy Optimization (TRPO) formalizes a simple idea: policy gradients can destroy performance in one bad update. TRPO constrains how much π can change per step using a KL divergence trust region. PPO approximates this with a clipped surrogate — understanding TRPO explains why PPO's clip works.


Learning objectives

  • State why large policy updates cause catastrophic performance drops.
  • Define KL divergence between old and new policy qualitatively.
  • Read the TRPO constrained optimization problem at a high level.
  • Contrast natural gradient / Fisher matrix intuition (no full derivation required).
  • See PPO as a practical TRPO successor.

The problem — step size in policy space

Actor–critic with big learning rate: return cliff-drops from 200 to 20 in one batch because π shifted too far in weight space — but weight distance ≠ policy distance. Two networks close in L2 can assign very different action probabilities.

Surrogate objective

Define probability ratio:

text
r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)

Unclipped surrogate (maximize):

text
L(θ) = E [ r_t(θ) · A_t ]

If r_t >> 1, update over-emphasizes unlikely actions under old policy — dangerous off-policy extrapolation.

TRPO constraint

text
maximize_θ  E [ r_t(θ) · A_t ]
subject to  E [ KL(π_old || π_θ) ] ≤ δ

δ might be 0.01 — average KL per state must stay small. Solving this uses conjugate gradient on Fisher-vector products (expensive). Result: monotonic-ish improvement guarantees in theory, heavy compute in practice.

Worked example — ratio blow-up

Old π(a|s) = 0.2, new π(a|s) = 0.8, advantage A = +5.

text
r = 0.8 / 0.2 = 4.0
contribution = 4.0 × 5 = 20

Optimizer sees huge positive gradient — pushes π even more extreme. TRPO/KL cap prevents this unless advantage truly warrants it.

KL divergence intuition

| KL(π_old || π_new) | Meaning | |--------------------|--------| | ~0 | Policies nearly identical | | 0.01 | Typical TRPO trust per update | | > 0.1 | Often performance collapse risk |

python
import torch
from torch.distributions import Categorical
 
def approx_kl(old_dist, new_dist):
    old_p = old_dist.probs
    new_p = new_dist.probs
    return (old_p * (old_p.log() - new_p.log())).sum(dim=-1).mean()

Log KL each PPO epoch — if KL spikes, reduce learning rate.

Natural gradient (one paragraph)

Euclidean gradient ascent in θ is not steepest ascent in distribution space. Natural gradient preconditions by inverse Fisher information F⁻¹ g — TRPO approximates a trust-region step without full F each time. PPO drops explicit KL constraint for clip — simpler code, similar empirical results.

TRPO vs PPO (preview)

AspectTRPOPPO
ConstraintHard KLClip r to [1−ε, 1+ε]
ImplementationCG + line searchMultiple SGD epochs on same batch
AdoptionResearch referenceIndustry default

Checkpoint — details: TRPO answers "how big a policy step is safe?" — PPO answers with a one-line clip in code. Summary: Limit policy change per update; KL measures change in action distributions, not weights.

Common mistakes

  1. Treating TRPO and PPO as unrelated — PPO is deliberate simplification of same surrogate.
  2. Huge PPO epochs without clip — reproduces pre-TRPO instability.
  3. Ignoring KL monitoring — even PPO benefits from early stopping on KL.
  4. Advantage scale vs δ — badly scaled advantages interact badly with trust region.
  5. Assuming TRPO guarantees in deep nets — theory assumptions weaken with function approximation.

Before this lesson


What's next