← Back to curriculum

Module 8 — Continuous control & robotics

Soft actor–critic (SAC)

Maximum entropy RL, twin Q critics, automatic temperature tuning.

~70 min read + exercises

Soft actor–critic (SAC)

Before we begin

Soft Actor–Critic (SAC) is a top choice for continuous control research and robotics prototypes. It keeps an off-policy replay buffer like DDPG but uses a stochastic actor optimized to maximize expected Q plus entropy — encouraging exploration while staying sample-efficient. SAC often matches or beats TD3 with less manual noise tuning.

SAC — maximum-entropy RL; maximize E[Σ γᵗ (rₜ + α H(π(·|sₜ)))].
Entropy bonus — rewards policy randomness; α controls explore/exploit.
Reparameterization trick — backprop through stochastic actions.


What you will learn

  • State the maximum entropy objective and role of temperature α.
  • Walk through SAC's twin Q, stochastic actor, and automatic α tuning.
  • Implement action sampling with tanh squashing and log-prob correction.
  • Compare SAC vs DDPG/TD3 on stability and hyperparameters.
  • Run SAC on Pendulum-v1 (Module 8 project baseline).

Maximum entropy objective

Standard RL: maximize expected return. SAC adds entropy H(π(·|s)) at each step:

J(π) = E[ Σₜ γᵗ ( rₜ + α H(π(·|sₜ)) ) ]

High α → more randomness → more exploration. Low α → near-greedy. Automatic entropy tuning adjusts α so entropy tracks a target (often −dim(A)).

αBehavior
LargeWide exploration, slower exploitation
SmallNear deterministic, risk of local minima
AutoLearn α with loss on entropy − target

SAC components

  1. Twin critics Q₁, Q₂ — min reduces overestimation (like TD3).
  2. Stochastic actor — Gaussian in pre-tanh space, squash to bounds.
  3. Target critics — soft Polyak updates.
  4. No separate target actor — sample from current actor for bootstrap.
  5. α — learnable log_α.
python
# Actor loss sketch (reparameterization)
a_pre, log_prob = actor.sample(s)  # includes tanh Jacobian
a = squash_to_env_bounds(a_pre)
q1, q2 = critic(s, a)
q_min = torch.min(q1, q2)
actor_loss = (alpha * log_prob - q_min).mean()

Critic target uses next action sampled from actor at s′ plus entropy term in soft Bellman backup.


Soft Bellman backup (intuition)

Target for Q:

y = r + γ ( minᵢ Qᵢ′(s′, a′) − α log π(a′|s′) )

where a′ ~ π(·|s′). The −α log π term is the entropy bonus in value space — future states prefer policies that stay stochastic when α is high.

Worked numeric intuition

If two actions have Q ≈ 5 and Q ≈ 5.1 but the second is nearly deterministic, SAC with α > 0 may still prefer the first — similar Q with higher entropy wins. This avoids premature collapse to a suboptimal deterministic policy.

Checkpoint: Why twin critics if SAC already has a stochastic actor?

Answer

Stochasticity does not fix Q overestimation from function approximation and bootstrapping. Twin critics + min, inherited from TD3, reduce optimistic targets that make the actor exploit critic errors.


Hyperparameters (practical)

ParamPendulum starting pointNotes
lr3e-4Adam for all nets
γ0.99
τ0.005Target soft update
buffer100k
batch256
warmup1000 random stepsFill buffer
target_entropy−dim(action)For auto α

Stable-Baselines3 SAC on Pendulum-v1 often solves in < 50k steps with defaults.


SAC vs DDPG / TD3 / PPO

SACTD3PPO
PolicyStochasticDeterministicStochastic
Off-policyYesYesNo
ExplorationEntropy + αNoiseOn-policy sampling
TuningModerateNoise, τClip ε, epochs
Sample efficiencyHighHighLower

Use SAC when off-policy data is precious; PPO when simplicity and on-policy stability matter more than sample count.


Common mistakes

MistakeSymptomFix
Wrong log_prob (no tanh fix)Biased actorUse SB3 / cleanrl reference
target_entropy = 0 on PendulumToo greedySet −1 for 1D action
No warmupEarly garbage gradientsRandom actions first
Huge α fixedNever convergesAuto-tune α
Eval with stochastic policyNoisy scoresUse mean or deterministic eval mode

Closing

SAC combines entropy-regularized objectives with off-policy actor–critic engineering. It is the default for many continuous benchmarks and your Pendulum project. Next: bridging sim to real when policies trained in MuJoCo must run on hardware.


Before this lesson


What's next