← Back to curriculum

Module 6 — Actor–critic & PPO

Project: PPO on Lunar Lander

Train PPO on LunarLander-v2; log clip fraction and solve threshold.

~140 min read + exercises

Project: PPO on Lunar Lander

Before we begin

Train PPO on LunarLander-v2 — a harder control task than CartPole. You may use Stable-Baselines3 (recommended) or implement clipped surrogate yourself. Log clip fraction and approximate KL to verify updates stay conservative.


How this connects to Module 6

LessonWhere you use it
GAEAdvantage estimates for policy gradient
TRPO / PPOTrust region via clipped ratio rₜ(θ)
A2C & parallel RLVectorized envs speed collection
PPO hyperparametersclip ε, n_steps, learning rate

What you will build

PiecePurpose
LunarLander training runPPO with MlpPolicy
TensorBoard or CSV logsReward, ep_len, clip_fraction
lander.gif or screenshotBest landing trajectory

Estimated time: 3–5 hours (SB3) or 8+ hours (from scratch).


Before you start

  • Finish the Module 6 quiz.
  • pip install gymnasium[box2d] stable-baselines3 tensorboard

Box2D dependency is required for LunarLander. On some systems: pip install swig first.


Step 1 — Baseline with Stable-Baselines3

python
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
 
env = gym.make("LunarLander-v2")
env = Monitor(env)
 
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    clip_range=0.2,
    verbose=1,
    tensorboard_log="./tb_logs/",
)
 
model.learn(total_timesteps=500_000)
model.save("ppo_lunar_lander")

Step 2 — Evaluate

python
model = PPO.load("ppo_lunar_lander")
obs, _ = env.reset(seed=42)
total = 0.0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, r, term, trunc, _ = env.step(action)
    total += r
    if term or trunc:
        break
print("Eval return:", total)

Run 20 eval episodes with deterministic=True. Report mean and std.


Step 3 — Record best landing (optional)

python
# pip install imageio
import imageio
frames = []
obs, _ = env.reset(seed=0)
for _ in range(1000):
    frames.append(env.render())
    action, _ = model.predict(obs, deterministic=True)
    obs, _, term, trunc, _ = env.step(action)
    if term or trunc:
        break
imageio.mimsave("outputs/lander.gif", frames, fps=30)

Success criteria

CriterionTarget
Agent lands without crashing most eval episodesRequired
Mean eval return ≥ 200 over 20 episodesLunarLander "solved" territory
Training logs saved (TB or CSV)Required
README notes one hyperparameter you changedRequired

Extension ideas

  • 4–8 parallel envs via SubprocVecEnv for faster sampling.
  • Implement clipped objective manually on CartPole first, then transfer to LunarLander.
  • Compare PPO vs A2C sample efficiency.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.