Monitoring & evaluation in production
Before we begin
A policy shipped to production drifts as the world changes — new users, seasonal demand, hardware wear. Monitoring tracks returns, costs, and data distribution; evaluation estimates whether a new policy beats the old one without reckless A/B risk. RL systems need ML ops plus bandit experiment design.
On-policy production eval — measure deployed policy on live traffic (with exploration budget).
Off-policy evaluation (OPE) — estimate new policy from logs.
Distribution shift — feature or reward distribution changes over time.
What you will learn
- Define SLIs for RL services: return, cost, latency, violation rate.
- Design A/B tests and interleaving for policy comparison.
- Run OPE with importance sampling and doubly robust estimators (intuition).
- Detect data drift on states and action distributions.
- Build dashboards and alerting for RL-specific failures.
What to log every step
| Field | Use |
|---|---|
| state / context | sₜ or features |
| action | aₜ, propensity if stochastic |
| reward components | rₜ decomposed |
| cost / constraint | cₜ for CMDP |
| policy version | model hash, config |
| behavior policy prob | π_b(a |
| latency | inference ms |
| episode id | aggregate trajectories |
Missing propensities blocks reliable OPE later.
log = dict(
state_feats=obs.tolist(),
action=int(a),
reward=float(r),
policy_version="sac-v3.2",
log_prob=float(log_pi),
behavior_log_prob=float(log_b), # if behavior != target
ts=time.time(),
)Service level indicators (RL)
| SLI | Alert if |
|---|---|
| Mean episodic return | Drops > X% vs 7d baseline |
| Cost rate | Exceeds CMDP budget |
| Constraint violations | Any spike |
| p99 inference latency | Above SLA |
| Exploration rate | Accidental zero |
| Action saturation | Stuck on boundary |
Decompose reward — a drop in one component localizes bugs faster than scalar return alone.
A/B testing policies
- Traffic split — 95% champion, 5% challenger (if safe).
- Contextual — assign by user id hash for consistency.
- Duration — long enough for episodic tasks (full user journeys).
- Guardrails — auto-promote challenger only if return ↑ and cost OK.
Interleaving (two policies alternate steps) increases sensitivity in some recommender settings — harder for long-horizon control.
Worked power example
Baseline return 100, std 20, want to detect +2% lift (Δ=2). Rough sample size grows with variance — plan thousands of episodes per arm, not dozens.
Checkpoint: Why is early stopping on "challenger winning" dangerous?
Answer
Peeking inflates false positive rate — multiple looks at the same data favor random wins. Use fixed horizons, sequential testing with correction, or Bayesian methods with pre-registered rules.
Off-policy evaluation recap
Estimate J(π_challenger) from logs collected under π_behavior:
- Importance weight ρ = π_challenger(a|s) / π_behavior(a|s) — truncate extreme weights.
- Doubly robust — combine direct method + IS for lower variance.
- FQE — fit Q^π on data, report average Q(s, π(s)).
OPE is biased if behavior support is thin — always validate on small live canary when possible.
Drift detection
| Signal | Method |
|---|---|
| State distribution | KL, PSI, MMD on features |
| Action distribution | Histogram vs train |
| Reward model gap | Predicted vs actual r |
| Q disagreement | Ensemble spread on live batch |
Trigger retrain or fallback to BC when drift exceeds threshold.
Dashboard layout (suggested)
- Overview — return, cost, traffic split.
- Components — reward terms stacked.
- Slices — per region, device, cohort.
- System — latency, errors, GPU.
- Experiments — active A/B status.
Annotate deploy events on time series to correlate regressions.
Incident response
- Rollback policy version.
- Freeze exploration.
- Pull last N hours logs for replay in sim.
- Root cause — data bug vs model vs env change.
- Postmortem — update constraints / tests.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Scalar return only | Mystery regressions | Log components |
| No policy version in logs | Cannot rollback correlate | Version tag |
| A/B without guardrail costs | Unsafe challenger wins | Multi-metric gate |
| Trust OPE alone | Deploy failure | Canary + OPE |
| Ignore seasonality | False drift alerts | Compare YoY / adjust |
| Eval greedy while train stochastic | Train/serve skew | Match modes |
Closing
Production RL is a closed loop: deploy, measure, compare, detect drift, retrain or rollback. You have now covered the full Deep RL track arc — from MDPs and tabular methods through deep actors, model-based planning, continuous control, and operating policies responsibly in the real world. The Module 9 project ties logging and serving into a minimal production-style pipeline.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.