Actor–critic architecture
Before we begin
An actor–critic pairs a policy actor π(a|s; θ) with a value critic V(s; w) or Q(s,a; w). The critic supplies low-variance learning signals; the actor improves the policy. This template underlies A2C, PPO, SAC, and most production RL today.
Learning objectives
- Draw the actor–critic data flow: env → actor → action → critic evaluates.
- Implement one-step actor–critic (TD advantage) on CartPole.
- Distinguish V-critic vs Q-critic actor–critic variants.
- Tune separate learning rates for actor and critic.
- Recognize actor–critic as on-policy unless importance sampling added.
Architecture diagram (conceptual)
┌──────────┐
s ──► │ Actor │ ──► a ~ π(·|s)
└──────────┘
│
▼
environment ──► r, s'
│
▼
┌──────────┐
s ──► │ Critic │ ──► V(s) or Q(s,a)
└──────────┘Shared trunk optional: common CNN layers, separate heads for policy and value (like dueling DQN, different purpose).
One-step TD advantage
δ_t = r_{t+1} + γ V(s_{t+1}) − V(s_t)
θ ← θ + α ∇ log π(a_t|s_t) · δ_t
w ← w + β δ_t ∇ V(s_t)δ_t is a one-step advantage estimate — lower variance than G_T, some bias.
Full step — PyTorch
import torch
import torch.nn as nn
gamma = 0.99
obs_t = torch.tensor(obs, dtype=torch.float32)
obs2_t = torch.tensor(obs2, dtype=torch.float32)
# actor
dist = actor(obs_t)
action = dist.sample()
log_prob = dist.log_prob(action)
# critic
v = critic(obs_t)
with torch.no_grad():
v2 = critic(obs2_t)
td_target = torch.tensor(reward, dtype=torch.float32) + gamma * (1.0 - done) * v2
delta = td_target - v
actor_loss = -log_prob * delta.detach()
critic_loss = delta.pow(2)
actor_opt.zero_grad()
actor_loss.backward()
actor_opt.step()
critic_opt.zero_grad()
critic_loss.backward()
critic_opt.step()Shared vs separate networks
| Design | Pros | Cons |
|---|---|---|
| Separate actor/critic | Simple, stable on small envs | More parameters |
| Shared trunk + two heads | Sample efficient on pixels | Critic gradients may harm features |
| Q-critic (SAC, DDPG) | Works continuous actions | More complex off-policy |
CartPole project: separate 2-layer MLPs are fine.
Worked example — TD delta
γ=0.99, r=+1, V(s)=10, V(s′)=12, done=False.
δ = 1 + 0.99 × 12 − 10 = 2.88Positive δ — action better than critic expected → increase log π(a|s). If V(s′) were 8: δ = −0.08 — slight penalty.
On-policy constraint
Standard actor–critic uses data from current π. When π changes mid-batch, old advantages are stale. PPO (Module 6) fixes this with clipped objectives. DQN-style replay with policy gradients needs importance sampling corrections (advanced).
Checkpoint — details: Actor–critic = REINFORCE + learned baseline that updates every step via TD. Summary: Two networks, one job each — actor chooses, critic judges surprise.
Common mistakes
- Critic loss on actor parameters — use
.detach()on δ for actor. - Same LR for actor and critic — critic often needs higher LR or more updates.
- Updating critic only at episode end — wastes TD structure; update each step.
- Entropy collapse — add entropy bonus −c · H(π) to actor loss for exploration.
- Bootstrapping at terminal — zero V(s′) when done=True.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.