Project: Production RL serving

Before we begin

This capstone is not another training loop — you serve a trained policy behind an HTTP API with health checks, batch inference, and latency logging, matching how RL models reach production in Module 9 lessons.

How this connects to Module 9

Lesson	Where you use it
Offline RL	Serve a frozen policy from logged training
Safety & deployment	Validate actions, fail closed on bad inputs
Monitoring	Log p50/p95 latency, request counts, errors
Evaluation	Shadow mode / A-B hooks (stub acceptable)

What you will build

Piece	Tech	Purpose
Trained policy	SB3 `.zip` or PyTorch `.pt`	CartPole or Pendulum from earlier modules
`serve.py`	FastAPI	`POST /act`, `GET /health`
`client.py`	requests	Load test + latency stats
`logs/`	JSON lines	Timestamp, obs hash, action, latency_ms

Folder layout:

text

rl-serving/
  policies/cartpole_ppo.zip   # export from Module 5/6 or quick SB3 train
  serve.py
  client.py
  logs/requests.jsonl
  README.md                   # architecture diagram in words + SLO notes

Estimated time: 4–6 hours.

Before you start

Finish the Module 9 quiz.
pip install fastapi uvicorn stable-baselines3 gymnasium pydantic
Have any trained discrete-action policy (CartPole PPO/DQN is fine).

Quick policy if needed:

python

import gymnasium as gym
from stable_baselines3 import PPO
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=0)
model.learn(50_000)
model.save("policies/cartpole_ppo")

Step 1 — FastAPI service

python

# serve.py
import time
import json
from pathlib import Path
 
import gymnasium as gym
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from stable_baselines3 import PPO
 
app = FastAPI(title="RL Policy Server")
model = PPO.load("policies/cartpole_ppo")
LOG = Path("logs/requests.jsonl")
LOG.parent.mkdir(exist_ok=True)
 
class ObsRequest(BaseModel):
    observation: list[float] = Field(..., min_length=4, max_length=4)
 
@app.get("/health")
def health():
    return {"status": "ok", "model": "cartpole_ppo"}
 
@app.post("/act")
def act(req: ObsRequest):
    t0 = time.perf_counter()
    obs = np.array(req.observation, dtype=np.float32)
    if not np.all(np.isfinite(obs)):
        raise HTTPException(400, "Invalid observation")
    action, _ = model.predict(obs, deterministic=True)
    latency_ms = (time.perf_counter() - t0) * 1000
    record = {"obs": req.observation, "action": int(action), "latency_ms": latency_ms}
    with LOG.open("a") as f:
        f.write(json.dumps(record) + "\n")
    return {"action": int(action), "latency_ms": round(latency_ms, 3)}

Run: uvicorn serve:app --host 0.0.0.0 --port 8000

Step 2 — Batch endpoint (stretch)

python

class BatchRequest(BaseModel):
    observations: list[list[float]]
 
@app.post("/act/batch")
def act_batch(req: BatchRequest):
  # predict each row; return actions + total latency
  ...

Step 3 — Client load test

python

# client.py
import time
import requests
import numpy as np
 
url = "http://127.0.0.1:8000/act"
latencies = []
for _ in range(200):
    obs = np.random.randn(4).tolist()  # use valid CartPole ranges in real test
    t0 = time.perf_counter()
    r = requests.post(url, json={"observation": obs}, timeout=5)
    r.raise_for_status()
    latencies.append((time.perf_counter() - t0) * 1000)
 
latencies.sort()
print("p50:", latencies[len(latencies)//2], "ms")
print("p95:", latencies[int(len(latencies)*0.95)], "ms")

Use realistic observations (e.g. from env.reset()) for meaningful tests.

Step 4 — Monitoring checklist

Signal	How
Availability	`GET /health` returns 200
Latency	p50/p95 from client or logs
Errors	400 on bad obs; 500 logged
Version	Include model name in health JSON

Success criteria

Criterion	Target
`/health` and `/act` work locally	Required
200 requests without crash	Required
p95 latency < 50 ms on CPU (CartPole MLP)	Typical
JSONL log with ≥ 50 entries	Required
README describes rollback if policy regresses	Recommended

Extension ideas

Dockerize with Dockerfile + HEALTHCHECK.
Prometheus /metrics counter for requests and errors.
Shadow mode: log what a new policy would do without serving it.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.