Project: SAC on Pendulum
Before we begin
Train Soft Actor–Critic (SAC) on Pendulum-v1 — a continuous action task. You will use a policy that outputs torques in [-2, 2], optimized with entropy-regularized actor–critic updates from Module 8.
How this connects to Module 8
| Lesson | Where you use it |
|---|---|
| Continuous actions | Gaussian / tanh-squashed policy |
| DDPG | Contrast: deterministic actor |
| SAC | Twin Q critics, entropy bonus, off-policy |
| Sim-to-real | Pendulum as minimal torque-control proxy |
What you will build
| Piece | Purpose |
|---|---|
| SAC training on Pendulum-v1 | Stable-Baselines3 or custom |
| Learning curve | Episode return (negative cost) vs step |
| Optional GIF | Pendulum swing-up behavior |
Estimated time: 3–5 hours with SB3.
Before you start
- Finish the Module 8 quiz.
pip install gymnasium stable-baselines3 matplotlib
Step 1 — Train SAC
python
import gymnasium as gym
from stable_baselines3 import SAC
env = gym.make("Pendulum-v1")
model = SAC(
"MlpPolicy",
env,
learning_rate=3e-4,
buffer_size=100_000,
learning_starts=1000,
batch_size=256,
tau=0.005,
gamma=0.99,
verbose=1,
)
model.learn(total_timesteps=100_000)
model.save("sac_pendulum")Pendulum reward is negative — closer to 0 is better (max ≈ 0 at upright balance).
Step 2 — Evaluate
python
model = SAC.load("sac_pendulum")
returns = []
for seed in range(20):
obs, _ = env.reset(seed=seed)
total = 0.0
done = False
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, r, term, trunc, _ = env.step(action)
total += r
done = term or trunc
returns.append(total)
print("Mean return:", sum(returns) / len(returns))Step 3 — Compare to random baseline
Random torque should yield mean return around −1200 to −1500. Trained SAC should reach > −200 (often > −150) within 100k steps.
Success criteria
| Criterion | Target |
|---|---|
| SAC trains without NaNs | Required |
| Mean eval return > −300 over 20 episodes | Minimum |
| Mean eval return > −200 | Strong |
| README mentions entropy coefficient role | Recommended |
Extension ideas
- Train DDPG on same env and compare sample efficiency.
- Plot action distribution heatmap over θ, θ̇.
- Add domain noise to observations (sim-to-real teaser).
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.