Actor–critic architecture

Before we begin

An actor–critic pairs a policy actor π(a|s; θ) with a value critic V(s; w) or Q(s,a; w). The critic supplies low-variance learning signals; the actor improves the policy. This template underlies A2C, PPO, SAC, and most production RL today.

Learning objectives

Draw the actor–critic data flow: env → actor → action → critic evaluates.
Implement one-step actor–critic (TD advantage) on CartPole.
Distinguish V-critic vs Q-critic actor–critic variants.
Tune separate learning rates for actor and critic.
Recognize actor–critic as on-policy unless importance sampling added.

Architecture diagram (conceptual)

text

        ┌──────────┐
  s ──► │  Actor   │ ──► a ~ π(·|s)
        └──────────┘
             │
             ▼
        environment ──► r, s'
             │
             ▼
        ┌──────────┐
  s ──► │  Critic  │ ──► V(s) or Q(s,a)
        └──────────┘

Shared trunk optional: common CNN layers, separate heads for policy and value (like dueling DQN, different purpose).

One-step TD advantage

text

δ_t = r_{t+1} + γ V(s_{t+1}) − V(s_t)
θ ← θ + α ∇ log π(a_t|s_t) · δ_t
w ← w + β δ_t ∇ V(s_t)

δ_t is a one-step advantage estimate — lower variance than G_T, some bias.

Full step — PyTorch

python

import torch
import torch.nn as nn
 
gamma = 0.99
 
obs_t = torch.tensor(obs, dtype=torch.float32)
obs2_t = torch.tensor(obs2, dtype=torch.float32)
 
# actor
dist = actor(obs_t)
action = dist.sample()
log_prob = dist.log_prob(action)
 
# critic
v = critic(obs_t)
with torch.no_grad():
    v2 = critic(obs2_t)
 
td_target = torch.tensor(reward, dtype=torch.float32) + gamma * (1.0 - done) * v2
delta = td_target - v
 
actor_loss = -log_prob * delta.detach()
critic_loss = delta.pow(2)
 
actor_opt.zero_grad()
actor_loss.backward()
actor_opt.step()
 
critic_opt.zero_grad()
critic_loss.backward()
critic_opt.step()

Shared vs separate networks

Design	Pros	Cons
Separate actor/critic	Simple, stable on small envs	More parameters
Shared trunk + two heads	Sample efficient on pixels	Critic gradients may harm features
Q-critic (SAC, DDPG)	Works continuous actions	More complex off-policy

CartPole project: separate 2-layer MLPs are fine.

Worked example — TD delta

γ=0.99, r=+1, V(s)=10, V(s′)=12, done=False.

text

δ = 1 + 0.99 × 12 − 10 = 2.88

Positive δ — action better than critic expected → increase log π(a|s). If V(s′) were 8: δ = −0.08 — slight penalty.

On-policy constraint

Standard actor–critic uses data from current π. When π changes mid-batch, old advantages are stale. PPO (Module 6) fixes this with clipped objectives. DQN-style replay with policy gradients needs importance sampling corrections (advanced).

Checkpoint — details: Actor–critic = REINFORCE + learned baseline that updates every step via TD. Summary: Two networks, one job each — actor chooses, critic judges surprise.

Common mistakes

Critic loss on actor parameters — use .detach() on δ for actor.
Same LR for actor and critic — critic often needs higher LR or more updates.
Updating critic only at episode end — wastes TD structure; update each step.
Entropy collapse — add entropy bonus −c · H(π) to actor loss for exploration.
Bootstrapping at terminal — zero V(s′) when done=True.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.