Project: Production RL serving
Before we begin
This capstone is not another training loop — you serve a trained policy behind an HTTP API with health checks, batch inference, and latency logging, matching how RL models reach production in Module 9 lessons.
How this connects to Module 9
| Lesson | Where you use it |
|---|---|
| Offline RL | Serve a frozen policy from logged training |
| Safety & deployment | Validate actions, fail closed on bad inputs |
| Monitoring | Log p50/p95 latency, request counts, errors |
| Evaluation | Shadow mode / A-B hooks (stub acceptable) |
What you will build
| Piece | Tech | Purpose |
|---|---|---|
| Trained policy | SB3 .zip or PyTorch .pt | CartPole or Pendulum from earlier modules |
serve.py | FastAPI | POST /act, GET /health |
client.py | requests | Load test + latency stats |
logs/ | JSON lines | Timestamp, obs hash, action, latency_ms |
Folder layout:
text
rl-serving/
policies/cartpole_ppo.zip # export from Module 5/6 or quick SB3 train
serve.py
client.py
logs/requests.jsonl
README.md # architecture diagram in words + SLO notesEstimated time: 4–6 hours.
Before you start
- Finish the Module 9 quiz.
pip install fastapi uvicorn stable-baselines3 gymnasium pydantic- Have any trained discrete-action policy (CartPole PPO/DQN is fine).
Quick policy if needed:
python
import gymnasium as gym
from stable_baselines3 import PPO
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=0)
model.learn(50_000)
model.save("policies/cartpole_ppo")Step 1 — FastAPI service
python
# serve.py
import time
import json
from pathlib import Path
import gymnasium as gym
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from stable_baselines3 import PPO
app = FastAPI(title="RL Policy Server")
model = PPO.load("policies/cartpole_ppo")
LOG = Path("logs/requests.jsonl")
LOG.parent.mkdir(exist_ok=True)
class ObsRequest(BaseModel):
observation: list[float] = Field(..., min_length=4, max_length=4)
@app.get("/health")
def health():
return {"status": "ok", "model": "cartpole_ppo"}
@app.post("/act")
def act(req: ObsRequest):
t0 = time.perf_counter()
obs = np.array(req.observation, dtype=np.float32)
if not np.all(np.isfinite(obs)):
raise HTTPException(400, "Invalid observation")
action, _ = model.predict(obs, deterministic=True)
latency_ms = (time.perf_counter() - t0) * 1000
record = {"obs": req.observation, "action": int(action), "latency_ms": latency_ms}
with LOG.open("a") as f:
f.write(json.dumps(record) + "\n")
return {"action": int(action), "latency_ms": round(latency_ms, 3)}Run: uvicorn serve:app --host 0.0.0.0 --port 8000
Step 2 — Batch endpoint (stretch)
python
class BatchRequest(BaseModel):
observations: list[list[float]]
@app.post("/act/batch")
def act_batch(req: BatchRequest):
# predict each row; return actions + total latency
...Step 3 — Client load test
python
# client.py
import time
import requests
import numpy as np
url = "http://127.0.0.1:8000/act"
latencies = []
for _ in range(200):
obs = np.random.randn(4).tolist() # use valid CartPole ranges in real test
t0 = time.perf_counter()
r = requests.post(url, json={"observation": obs}, timeout=5)
r.raise_for_status()
latencies.append((time.perf_counter() - t0) * 1000)
latencies.sort()
print("p50:", latencies[len(latencies)//2], "ms")
print("p95:", latencies[int(len(latencies)*0.95)], "ms")Use realistic observations (e.g. from env.reset()) for meaningful tests.
Step 4 — Monitoring checklist
| Signal | How |
|---|---|
| Availability | GET /health returns 200 |
| Latency | p50/p95 from client or logs |
| Errors | 400 on bad obs; 500 logged |
| Version | Include model name in health JSON |
Success criteria
| Criterion | Target |
|---|---|
/health and /act work locally | Required |
| 200 requests without crash | Required |
| p95 latency < 50 ms on CPU (CartPole MLP) | Typical |
| JSONL log with ≥ 50 entries | Required |
| README describes rollback if policy regresses | Recommended |
Extension ideas
- Dockerize with
Dockerfile+HEALTHCHECK. - Prometheus
/metricscounter for requests and errors. - Shadow mode: log what a new policy would do without serving it.
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.