Soft actor–critic (SAC)
Before we begin
Soft Actor–Critic (SAC) is a top choice for continuous control research and robotics prototypes. It keeps an off-policy replay buffer like DDPG but uses a stochastic actor optimized to maximize expected Q plus entropy — encouraging exploration while staying sample-efficient. SAC often matches or beats TD3 with less manual noise tuning.
SAC — maximum-entropy RL; maximize E[Σ γᵗ (rₜ + α H(π(·|sₜ)))].
Entropy bonus — rewards policy randomness; α controls explore/exploit.
Reparameterization trick — backprop through stochastic actions.
What you will learn
- State the maximum entropy objective and role of temperature α.
- Walk through SAC's twin Q, stochastic actor, and automatic α tuning.
- Implement action sampling with tanh squashing and log-prob correction.
- Compare SAC vs DDPG/TD3 on stability and hyperparameters.
- Run SAC on Pendulum-v1 (Module 8 project baseline).
Maximum entropy objective
Standard RL: maximize expected return. SAC adds entropy H(π(·|s)) at each step:
J(π) = E[ Σₜ γᵗ ( rₜ + α H(π(·|sₜ)) ) ]
High α → more randomness → more exploration. Low α → near-greedy. Automatic entropy tuning adjusts α so entropy tracks a target (often −dim(A)).
| α | Behavior |
|---|---|
| Large | Wide exploration, slower exploitation |
| Small | Near deterministic, risk of local minima |
| Auto | Learn α with loss on entropy − target |
SAC components
- Twin critics Q₁, Q₂ — min reduces overestimation (like TD3).
- Stochastic actor — Gaussian in pre-tanh space, squash to bounds.
- Target critics — soft Polyak updates.
- No separate target actor — sample from current actor for bootstrap.
- α — learnable log_α.
# Actor loss sketch (reparameterization)
a_pre, log_prob = actor.sample(s) # includes tanh Jacobian
a = squash_to_env_bounds(a_pre)
q1, q2 = critic(s, a)
q_min = torch.min(q1, q2)
actor_loss = (alpha * log_prob - q_min).mean()Critic target uses next action sampled from actor at s′ plus entropy term in soft Bellman backup.
Soft Bellman backup (intuition)
Target for Q:
y = r + γ ( minᵢ Qᵢ′(s′, a′) − α log π(a′|s′) )
where a′ ~ π(·|s′). The −α log π term is the entropy bonus in value space — future states prefer policies that stay stochastic when α is high.
Worked numeric intuition
If two actions have Q ≈ 5 and Q ≈ 5.1 but the second is nearly deterministic, SAC with α > 0 may still prefer the first — similar Q with higher entropy wins. This avoids premature collapse to a suboptimal deterministic policy.
Checkpoint: Why twin critics if SAC already has a stochastic actor?
Answer
Stochasticity does not fix Q overestimation from function approximation and bootstrapping. Twin critics + min, inherited from TD3, reduce optimistic targets that make the actor exploit critic errors.
Hyperparameters (practical)
| Param | Pendulum starting point | Notes |
|---|---|---|
| lr | 3e-4 | Adam for all nets |
| γ | 0.99 | |
| τ | 0.005 | Target soft update |
| buffer | 100k | |
| batch | 256 | |
| warmup | 1000 random steps | Fill buffer |
| target_entropy | −dim(action) | For auto α |
Stable-Baselines3 SAC on Pendulum-v1 often solves in < 50k steps with defaults.
SAC vs DDPG / TD3 / PPO
| SAC | TD3 | PPO | |
|---|---|---|---|
| Policy | Stochastic | Deterministic | Stochastic |
| Off-policy | Yes | Yes | No |
| Exploration | Entropy + α | Noise | On-policy sampling |
| Tuning | Moderate | Noise, τ | Clip ε, epochs |
| Sample efficiency | High | High | Lower |
Use SAC when off-policy data is precious; PPO when simplicity and on-policy stability matter more than sample count.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Wrong log_prob (no tanh fix) | Biased actor | Use SB3 / cleanrl reference |
| target_entropy = 0 on Pendulum | Too greedy | Set −1 for 1D action |
| No warmup | Early garbage gradients | Random actions first |
| Huge α fixed | Never converges | Auto-tune α |
| Eval with stochastic policy | Noisy scores | Use mean or deterministic eval mode |
Closing
SAC combines entropy-regularized objectives with off-policy actor–critic engineering. It is the default for many continuous benchmarks and your Pendulum project. Next: bridging sim to real when policies trained in MuJoCo must run on hardware.