Project: SAC on Pendulum

Before we begin

Train Soft Actor–Critic (SAC) on Pendulum-v1 — a continuous action task. You will use a policy that outputs torques in [-2, 2], optimized with entropy-regularized actor–critic updates from Module 8.

How this connects to Module 8

Lesson	Where you use it
Continuous actions	Gaussian / tanh-squashed policy
DDPG	Contrast: deterministic actor
SAC	Twin Q critics, entropy bonus, off-policy
Sim-to-real	Pendulum as minimal torque-control proxy

What you will build

Piece	Purpose
SAC training on Pendulum-v1	Stable-Baselines3 or custom
Learning curve	Episode return (negative cost) vs step
Optional GIF	Pendulum swing-up behavior

Estimated time: 3–5 hours with SB3.

Before you start

Finish the Module 8 quiz.
pip install gymnasium stable-baselines3 matplotlib

Step 1 — Train SAC

python

import gymnasium as gym
from stable_baselines3 import SAC
 
env = gym.make("Pendulum-v1")
 
model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=100_000,
    learning_starts=1000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    verbose=1,
)
 
model.learn(total_timesteps=100_000)
model.save("sac_pendulum")

Pendulum reward is negative — closer to 0 is better (max ≈ 0 at upright balance).

Step 2 — Evaluate

python

model = SAC.load("sac_pendulum")
returns = []
for seed in range(20):
    obs, _ = env.reset(seed=seed)
    total = 0.0
    done = False
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, r, term, trunc, _ = env.step(action)
        total += r
        done = term or trunc
    returns.append(total)
print("Mean return:", sum(returns) / len(returns))

Step 3 — Compare to random baseline

Random torque should yield mean return around −1200 to −1500. Trained SAC should reach > −200 (often > −150) within 100k steps.

Success criteria

Criterion	Target
SAC trains without NaNs	Required
Mean eval return > −300 over 20 episodes	Minimum
Mean eval return > −200	Strong
README mentions entropy coefficient role	Recommended

Extension ideas

Train DDPG on same env and compare sample efficiency.
Plot action distribution heatmap over θ, θ̇.
Add domain noise to observations (sim-to-real teaser).

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.