← Back to curriculum

Module 8 — Continuous control & robotics

Project: SAC on Pendulum

Train SAC on Pendulum-v1; compare entropy and learning curves to DDPG.

~130 min read + exercises

Project: SAC on Pendulum

Before we begin

Train Soft Actor–Critic (SAC) on Pendulum-v1 — a continuous action task. You will use a policy that outputs torques in [-2, 2], optimized with entropy-regularized actor–critic updates from Module 8.


How this connects to Module 8

LessonWhere you use it
Continuous actionsGaussian / tanh-squashed policy
DDPGContrast: deterministic actor
SACTwin Q critics, entropy bonus, off-policy
Sim-to-realPendulum as minimal torque-control proxy

What you will build

PiecePurpose
SAC training on Pendulum-v1Stable-Baselines3 or custom
Learning curveEpisode return (negative cost) vs step
Optional GIFPendulum swing-up behavior

Estimated time: 3–5 hours with SB3.


Before you start

  • Finish the Module 8 quiz.
  • pip install gymnasium stable-baselines3 matplotlib

Step 1 — Train SAC

python
import gymnasium as gym
from stable_baselines3 import SAC
 
env = gym.make("Pendulum-v1")
 
model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=100_000,
    learning_starts=1000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    verbose=1,
)
 
model.learn(total_timesteps=100_000)
model.save("sac_pendulum")

Pendulum reward is negative — closer to 0 is better (max ≈ 0 at upright balance).


Step 2 — Evaluate

python
model = SAC.load("sac_pendulum")
returns = []
for seed in range(20):
    obs, _ = env.reset(seed=seed)
    total = 0.0
    done = False
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, r, term, trunc, _ = env.step(action)
        total += r
        done = term or trunc
    returns.append(total)
print("Mean return:", sum(returns) / len(returns))

Step 3 — Compare to random baseline

Random torque should yield mean return around −1200 to −1500. Trained SAC should reach > −200 (often > −150) within 100k steps.


Success criteria

CriterionTarget
SAC trains without NaNsRequired
Mean eval return > −300 over 20 episodesMinimum
Mean eval return > −200Strong
README mentions entropy coefficient roleRecommended

Extension ideas

  • Train DDPG on same env and compare sample efficiency.
  • Plot action distribution heatmap over θ, θ̇.
  • Add domain noise to observations (sim-to-real teaser).

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.