Soft actor–critic (SAC)

Before we begin

Soft Actor–Critic (SAC) is a top choice for continuous control research and robotics prototypes. It keeps an off-policy replay buffer like DDPG but uses a stochastic actor optimized to maximize expected Q plus entropy — encouraging exploration while staying sample-efficient. SAC often matches or beats TD3 with less manual noise tuning.

SAC — maximum-entropy RL; maximize E[Σ γᵗ (rₜ + α H(π(·|sₜ)))].
Entropy bonus — rewards policy randomness; α controls explore/exploit.
Reparameterization trick — backprop through stochastic actions.

What you will learn

State the maximum entropy objective and role of temperature α.
Walk through SAC's twin Q, stochastic actor, and automatic α tuning.
Implement action sampling with tanh squashing and log-prob correction.
Compare SAC vs DDPG/TD3 on stability and hyperparameters.
Run SAC on Pendulum-v1 (Module 8 project baseline).

Maximum entropy objective

Standard RL: maximize expected return. SAC adds entropy H(π(·|s)) at each step:

J(π) = E[ Σₜ γᵗ ( rₜ + α H(π(·|sₜ)) ) ]

High α → more randomness → more exploration. Low α → near-greedy. Automatic entropy tuning adjusts α so entropy tracks a target (often −dim(A)).

α	Behavior
Large	Wide exploration, slower exploitation
Small	Near deterministic, risk of local minima
Auto	Learn α with loss on entropy − target

SAC components

Twin critics Q₁, Q₂ — min reduces overestimation (like TD3).
Stochastic actor — Gaussian in pre-tanh space, squash to bounds.
Target critics — soft Polyak updates.
No separate target actor — sample from current actor for bootstrap.
α — learnable log_α.

python

# Actor loss sketch (reparameterization)
a_pre, log_prob = actor.sample(s)  # includes tanh Jacobian
a = squash_to_env_bounds(a_pre)
q1, q2 = critic(s, a)
q_min = torch.min(q1, q2)
actor_loss = (alpha * log_prob - q_min).mean()

Critic target uses next action sampled from actor at s′ plus entropy term in soft Bellman backup.

Soft Bellman backup (intuition)

Target for Q:

y = r + γ ( minᵢ Qᵢ′(s′, a′) − α log π(a′|s′) )

where a′ ~ π(·|s′). The −α log π term is the entropy bonus in value space — future states prefer policies that stay stochastic when α is high.

Worked numeric intuition

If two actions have Q ≈ 5 and Q ≈ 5.1 but the second is nearly deterministic, SAC with α > 0 may still prefer the first — similar Q with higher entropy wins. This avoids premature collapse to a suboptimal deterministic policy.

Checkpoint: Why twin critics if SAC already has a stochastic actor?

Answer

Stochasticity does not fix Q overestimation from function approximation and bootstrapping. Twin critics + min, inherited from TD3, reduce optimistic targets that make the actor exploit critic errors.

Hyperparameters (practical)

Param	Pendulum starting point	Notes
lr	3e-4	Adam for all nets
γ	0.99
τ	0.005	Target soft update
buffer	100k
batch	256
warmup	1000 random steps	Fill buffer
target_entropy	−dim(action)	For auto α

Stable-Baselines3 SAC on Pendulum-v1 often solves in < 50k steps with defaults.

SAC vs DDPG / TD3 / PPO

	SAC	TD3	PPO
Policy	Stochastic	Deterministic	Stochastic
Off-policy	Yes	Yes	No
Exploration	Entropy + α	Noise	On-policy sampling
Tuning	Moderate	Noise, τ	Clip ε, epochs
Sample efficiency	High	High	Lower

Use SAC when off-policy data is precious; PPO when simplicity and on-policy stability matter more than sample count.

Common mistakes

Mistake	Symptom	Fix
Wrong log_prob (no tanh fix)	Biased actor	Use SB3 / cleanrl reference
target_entropy = 0 on Pendulum	Too greedy	Set −1 for 1D action
No warmup	Early garbage gradients	Random actions first
Huge α fixed	Never converges	Auto-tune α
Eval with stochastic policy	Noisy scores	Use mean or deterministic eval mode

Closing

SAC combines entropy-regularized objectives with off-policy actor–critic engineering. It is the default for many continuous benchmarks and your Pendulum project. Next: bridging sim to real when policies trained in MuJoCo must run on hardware.

Before this lesson

Previous lesson

What's next

Next lesson — Sim-to-real & domain randomization