TRPO intuition

Before we begin

Trust Region Policy Optimization (TRPO) formalizes a simple idea: policy gradients can destroy performance in one bad update. TRPO constrains how much π can change per step using a KL divergence trust region. PPO approximates this with a clipped surrogate — understanding TRPO explains why PPO's clip works.

Learning objectives

State why large policy updates cause catastrophic performance drops.
Define KL divergence between old and new policy qualitatively.
Read the TRPO constrained optimization problem at a high level.
Contrast natural gradient / Fisher matrix intuition (no full derivation required).
See PPO as a practical TRPO successor.

The problem — step size in policy space

Actor–critic with big learning rate: return cliff-drops from 200 to 20 in one batch because π shifted too far in weight space — but weight distance ≠ policy distance. Two networks close in L2 can assign very different action probabilities.

Surrogate objective

Define probability ratio:

text

r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)

Unclipped surrogate (maximize):

text

L(θ) = E [ r_t(θ) · A_t ]

If r_t >> 1, update over-emphasizes unlikely actions under old policy — dangerous off-policy extrapolation.

TRPO constraint

text

maximize_θ  E [ r_t(θ) · A_t ]
subject to  E [ KL(π_old || π_θ) ] ≤ δ

δ might be 0.01 — average KL per state must stay small. Solving this uses conjugate gradient on Fisher-vector products (expensive). Result: monotonic-ish improvement guarantees in theory, heavy compute in practice.

Worked example — ratio blow-up

Old π(a|s) = 0.2, new π(a|s) = 0.8, advantage A = +5.

text

r = 0.8 / 0.2 = 4.0
contribution = 4.0 × 5 = 20

Optimizer sees huge positive gradient — pushes π even more extreme. TRPO/KL cap prevents this unless advantage truly warrants it.

KL divergence intuition

| KL(π_old || π_new) | Meaning | |--------------------|--------| | ~0 | Policies nearly identical | | 0.01 | Typical TRPO trust per update | | > 0.1 | Often performance collapse risk |

python

import torch
from torch.distributions import Categorical
 
def approx_kl(old_dist, new_dist):
    old_p = old_dist.probs
    new_p = new_dist.probs
    return (old_p * (old_p.log() - new_p.log())).sum(dim=-1).mean()

Log KL each PPO epoch — if KL spikes, reduce learning rate.

Natural gradient (one paragraph)

Euclidean gradient ascent in θ is not steepest ascent in distribution space. Natural gradient preconditions by inverse Fisher information F⁻¹ g — TRPO approximates a trust-region step without full F each time. PPO drops explicit KL constraint for clip — simpler code, similar empirical results.

TRPO vs PPO (preview)

Aspect	TRPO	PPO
Constraint	Hard KL	Clip r to [1−ε, 1+ε]
Implementation	CG + line search	Multiple SGD epochs on same batch
Adoption	Research reference	Industry default

Checkpoint — details: TRPO answers "how big a policy step is safe?" — PPO answers with a one-line clip in code. Summary: Limit policy change per update; KL measures change in action distributions, not weights.

Common mistakes

Treating TRPO and PPO as unrelated — PPO is deliberate simplification of same surrogate.
Huge PPO epochs without clip — reproduces pre-TRPO instability.
Ignoring KL monitoring — even PPO benefits from early stopping on KL.
Advantage scale vs δ — badly scaled advantages interact badly with trust region.
Assuming TRPO guarantees in deep nets — theory assumptions weaken with function approximation.

Before this lesson

Previous lesson

What's next

Next lesson — Proximal policy optimization