TRPO intuition
Before we begin
Trust Region Policy Optimization (TRPO) formalizes a simple idea: policy gradients can destroy performance in one bad update. TRPO constrains how much π can change per step using a KL divergence trust region. PPO approximates this with a clipped surrogate — understanding TRPO explains why PPO's clip works.
Learning objectives
- State why large policy updates cause catastrophic performance drops.
- Define KL divergence between old and new policy qualitatively.
- Read the TRPO constrained optimization problem at a high level.
- Contrast natural gradient / Fisher matrix intuition (no full derivation required).
- See PPO as a practical TRPO successor.
The problem — step size in policy space
Actor–critic with big learning rate: return cliff-drops from 200 to 20 in one batch because π shifted too far in weight space — but weight distance ≠ policy distance. Two networks close in L2 can assign very different action probabilities.
Surrogate objective
Define probability ratio:
r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)Unclipped surrogate (maximize):
L(θ) = E [ r_t(θ) · A_t ]If r_t >> 1, update over-emphasizes unlikely actions under old policy — dangerous off-policy extrapolation.
TRPO constraint
maximize_θ E [ r_t(θ) · A_t ]
subject to E [ KL(π_old || π_θ) ] ≤ δδ might be 0.01 — average KL per state must stay small. Solving this uses conjugate gradient on Fisher-vector products (expensive). Result: monotonic-ish improvement guarantees in theory, heavy compute in practice.
Worked example — ratio blow-up
Old π(a|s) = 0.2, new π(a|s) = 0.8, advantage A = +5.
r = 0.8 / 0.2 = 4.0
contribution = 4.0 × 5 = 20Optimizer sees huge positive gradient — pushes π even more extreme. TRPO/KL cap prevents this unless advantage truly warrants it.
KL divergence intuition
| KL(π_old || π_new) | Meaning | |--------------------|--------| | ~0 | Policies nearly identical | | 0.01 | Typical TRPO trust per update | | > 0.1 | Often performance collapse risk |
import torch
from torch.distributions import Categorical
def approx_kl(old_dist, new_dist):
old_p = old_dist.probs
new_p = new_dist.probs
return (old_p * (old_p.log() - new_p.log())).sum(dim=-1).mean()Log KL each PPO epoch — if KL spikes, reduce learning rate.
Natural gradient (one paragraph)
Euclidean gradient ascent in θ is not steepest ascent in distribution space. Natural gradient preconditions by inverse Fisher information F⁻¹ g — TRPO approximates a trust-region step without full F each time. PPO drops explicit KL constraint for clip — simpler code, similar empirical results.
TRPO vs PPO (preview)
| Aspect | TRPO | PPO |
|---|---|---|
| Constraint | Hard KL | Clip r to [1−ε, 1+ε] |
| Implementation | CG + line search | Multiple SGD epochs on same batch |
| Adoption | Research reference | Industry default |
Checkpoint — details: TRPO answers "how big a policy step is safe?" — PPO answers with a one-line clip in code. Summary: Limit policy change per update; KL measures change in action distributions, not weights.
Common mistakes
- Treating TRPO and PPO as unrelated — PPO is deliberate simplification of same surrogate.
- Huge PPO epochs without clip — reproduces pre-TRPO instability.
- Ignoring KL monitoring — even PPO benefits from early stopping on KL.
- Advantage scale vs δ — badly scaled advantages interact badly with trust region.
- Assuming TRPO guarantees in deep nets — theory assumptions weaken with function approximation.