Project: PPO on Lunar Lander
Before we begin
Train PPO on LunarLander-v2 — a harder control task than CartPole. You may use Stable-Baselines3 (recommended) or implement clipped surrogate yourself. Log clip fraction and approximate KL to verify updates stay conservative.
How this connects to Module 6
| Lesson | Where you use it |
|---|---|
| GAE | Advantage estimates for policy gradient |
| TRPO / PPO | Trust region via clipped ratio rₜ(θ) |
| A2C & parallel RL | Vectorized envs speed collection |
| PPO hyperparameters | clip ε, n_steps, learning rate |
What you will build
| Piece | Purpose |
|---|---|
| LunarLander training run | PPO with MlpPolicy |
| TensorBoard or CSV logs | Reward, ep_len, clip_fraction |
lander.gif or screenshot | Best landing trajectory |
Estimated time: 3–5 hours (SB3) or 8+ hours (from scratch).
Before you start
- Finish the Module 6 quiz.
pip install gymnasium[box2d] stable-baselines3 tensorboard
Box2D dependency is required for LunarLander. On some systems: pip install swig first.
Step 1 — Baseline with Stable-Baselines3
python
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
env = gym.make("LunarLander-v2")
env = Monitor(env)
model = PPO(
"MlpPolicy",
env,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
clip_range=0.2,
verbose=1,
tensorboard_log="./tb_logs/",
)
model.learn(total_timesteps=500_000)
model.save("ppo_lunar_lander")Step 2 — Evaluate
python
model = PPO.load("ppo_lunar_lander")
obs, _ = env.reset(seed=42)
total = 0.0
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, r, term, trunc, _ = env.step(action)
total += r
if term or trunc:
break
print("Eval return:", total)Run 20 eval episodes with deterministic=True. Report mean and std.
Step 3 — Record best landing (optional)
python
# pip install imageio
import imageio
frames = []
obs, _ = env.reset(seed=0)
for _ in range(1000):
frames.append(env.render())
action, _ = model.predict(obs, deterministic=True)
obs, _, term, trunc, _ = env.step(action)
if term or trunc:
break
imageio.mimsave("outputs/lander.gif", frames, fps=30)Success criteria
| Criterion | Target |
|---|---|
| Agent lands without crashing most eval episodes | Required |
| Mean eval return ≥ 200 over 20 episodes | LunarLander "solved" territory |
| Training logs saved (TB or CSV) | Required |
| README notes one hyperparameter you changed | Required |
Extension ideas
- 4–8 parallel envs via
SubprocVecEnvfor faster sampling. - Implement clipped objective manually on CartPole first, then transfer to LunarLander.
- Compare PPO vs A2C sample efficiency.
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.