Project: PPO on Lunar Lander

Before we begin

Train PPO on LunarLander-v2 — a harder control task than CartPole. You may use Stable-Baselines3 (recommended) or implement clipped surrogate yourself. Log clip fraction and approximate KL to verify updates stay conservative.

How this connects to Module 6

Lesson	Where you use it
GAE	Advantage estimates for policy gradient
TRPO / PPO	Trust region via clipped ratio rₜ(θ)
A2C & parallel RL	Vectorized envs speed collection
PPO hyperparameters	clip ε, n_steps, learning rate

What you will build

Piece	Purpose
LunarLander training run	PPO with MlpPolicy
TensorBoard or CSV logs	Reward, ep_len, clip_fraction
`lander.gif` or screenshot	Best landing trajectory

Estimated time: 3–5 hours (SB3) or 8+ hours (from scratch).

Before you start

Finish the Module 6 quiz.
pip install gymnasium[box2d] stable-baselines3 tensorboard

Box2D dependency is required for LunarLander. On some systems: pip install swig first.

Step 1 — Baseline with Stable-Baselines3

python

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
 
env = gym.make("LunarLander-v2")
env = Monitor(env)
 
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    clip_range=0.2,
    verbose=1,
    tensorboard_log="./tb_logs/",
)
 
model.learn(total_timesteps=500_000)
model.save("ppo_lunar_lander")

Step 2 — Evaluate

python

model = PPO.load("ppo_lunar_lander")
obs, _ = env.reset(seed=42)
total = 0.0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, r, term, trunc, _ = env.step(action)
    total += r
    if term or trunc:
        break
print("Eval return:", total)

Run 20 eval episodes with deterministic=True. Report mean and std.

Step 3 — Record best landing (optional)

python

# pip install imageio
import imageio
frames = []
obs, _ = env.reset(seed=0)
for _ in range(1000):
    frames.append(env.render())
    action, _ = model.predict(obs, deterministic=True)
    obs, _, term, trunc, _ = env.step(action)
    if term or trunc:
        break
imageio.mimsave("outputs/lander.gif", frames, fps=30)

Success criteria

Criterion	Target
Agent lands without crashing most eval episodes	Required
Mean eval return ≥ 200 over 20 episodes	LunarLander "solved" territory
Training logs saved (TB or CSV)	Required
README notes one hyperparameter you changed	Required

Extension ideas

4–8 parallel envs via SubprocVecEnv for faster sampling.
Implement clipped objective manually on CartPole first, then transfer to LunarLander.
Compare PPO vs A2C sample efficiency.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.