← Back to curriculum

Module 5 — Policy gradients

Actor–critic architecture

Two networks: policy actor and value critic; TD bootstrapping for critics.

~65 min read + exercises

Actor–critic architecture

Before we begin

An actor–critic pairs a policy actor π(a|s; θ) with a value critic V(s; w) or Q(s,a; w). The critic supplies low-variance learning signals; the actor improves the policy. This template underlies A2C, PPO, SAC, and most production RL today.


Learning objectives

  • Draw the actor–critic data flow: env → actor → action → critic evaluates.
  • Implement one-step actor–critic (TD advantage) on CartPole.
  • Distinguish V-critic vs Q-critic actor–critic variants.
  • Tune separate learning rates for actor and critic.
  • Recognize actor–critic as on-policy unless importance sampling added.

Architecture diagram (conceptual)

text
        ┌──────────┐
  s ──► │  Actor   │ ──► a ~ π(·|s)
        └──────────┘


        environment ──► r, s'


        ┌──────────┐
  s ──► │  Critic  │ ──► V(s) or Q(s,a)
        └──────────┘

Shared trunk optional: common CNN layers, separate heads for policy and value (like dueling DQN, different purpose).

One-step TD advantage

text
δ_t = r_{t+1} + γ V(s_{t+1}) − V(s_t)
θ ← θ + α ∇ log π(a_t|s_t) · δ_t
w ← w + β δ_t ∇ V(s_t)

δ_t is a one-step advantage estimate — lower variance than G_T, some bias.

Full step — PyTorch

python
import torch
import torch.nn as nn
 
gamma = 0.99
 
obs_t = torch.tensor(obs, dtype=torch.float32)
obs2_t = torch.tensor(obs2, dtype=torch.float32)
 
# actor
dist = actor(obs_t)
action = dist.sample()
log_prob = dist.log_prob(action)
 
# critic
v = critic(obs_t)
with torch.no_grad():
    v2 = critic(obs2_t)
 
td_target = torch.tensor(reward, dtype=torch.float32) + gamma * (1.0 - done) * v2
delta = td_target - v
 
actor_loss = -log_prob * delta.detach()
critic_loss = delta.pow(2)
 
actor_opt.zero_grad()
actor_loss.backward()
actor_opt.step()
 
critic_opt.zero_grad()
critic_loss.backward()
critic_opt.step()

Shared vs separate networks

DesignProsCons
Separate actor/criticSimple, stable on small envsMore parameters
Shared trunk + two headsSample efficient on pixelsCritic gradients may harm features
Q-critic (SAC, DDPG)Works continuous actionsMore complex off-policy

CartPole project: separate 2-layer MLPs are fine.

Worked example — TD delta

γ=0.99, r=+1, V(s)=10, V(s′)=12, done=False.

text
δ = 1 + 0.99 × 12 − 10 = 2.88

Positive δ — action better than critic expected → increase log π(a|s). If V(s′) were 8: δ = −0.08 — slight penalty.

On-policy constraint

Standard actor–critic uses data from current π. When π changes mid-batch, old advantages are stale. PPO (Module 6) fixes this with clipped objectives. DQN-style replay with policy gradients needs importance sampling corrections (advanced).

Checkpoint — details: Actor–critic = REINFORCE + learned baseline that updates every step via TD. Summary: Two networks, one job each — actor chooses, critic judges surprise.

Common mistakes

  1. Critic loss on actor parameters — use .detach() on δ for actor.
  2. Same LR for actor and critic — critic often needs higher LR or more updates.
  3. Updating critic only at episode end — wastes TD structure; update each step.
  4. Entropy collapse — add entropy bonus −c · H(π) to actor loss for exploration.
  5. Bootstrapping at terminal — zero V(s′) when done=True.

Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.