← Back to curriculum

Module 9 — Production & advanced topics

Monitoring & evaluation in production

Online vs offline metrics, A/B tests, drift, and rollback strategies.

~55 min read + exercises

Monitoring & evaluation in production

Before we begin

A policy shipped to production drifts as the world changes — new users, seasonal demand, hardware wear. Monitoring tracks returns, costs, and data distribution; evaluation estimates whether a new policy beats the old one without reckless A/B risk. RL systems need ML ops plus bandit experiment design.

On-policy production eval — measure deployed policy on live traffic (with exploration budget).
Off-policy evaluation (OPE) — estimate new policy from logs.
Distribution shift — feature or reward distribution changes over time.


What you will learn

  • Define SLIs for RL services: return, cost, latency, violation rate.
  • Design A/B tests and interleaving for policy comparison.
  • Run OPE with importance sampling and doubly robust estimators (intuition).
  • Detect data drift on states and action distributions.
  • Build dashboards and alerting for RL-specific failures.

What to log every step

FieldUse
state / contextsₜ or features
actionaₜ, propensity if stochastic
reward componentsrₜ decomposed
cost / constraintcₜ for CMDP
policy versionmodel hash, config
behavior policy probπ_b(a
latencyinference ms
episode idaggregate trajectories

Missing propensities blocks reliable OPE later.

python
log = dict(
    state_feats=obs.tolist(),
    action=int(a),
    reward=float(r),
    policy_version="sac-v3.2",
    log_prob=float(log_pi),
    behavior_log_prob=float(log_b),  # if behavior != target
    ts=time.time(),
)

Service level indicators (RL)

SLIAlert if
Mean episodic returnDrops > X% vs 7d baseline
Cost rateExceeds CMDP budget
Constraint violationsAny spike
p99 inference latencyAbove SLA
Exploration rateAccidental zero
Action saturationStuck on boundary

Decompose reward — a drop in one component localizes bugs faster than scalar return alone.


A/B testing policies

  • Traffic split — 95% champion, 5% challenger (if safe).
  • Contextual — assign by user id hash for consistency.
  • Duration — long enough for episodic tasks (full user journeys).
  • Guardrails — auto-promote challenger only if return ↑ and cost OK.

Interleaving (two policies alternate steps) increases sensitivity in some recommender settings — harder for long-horizon control.

Worked power example

Baseline return 100, std 20, want to detect +2% lift (Δ=2). Rough sample size grows with variance — plan thousands of episodes per arm, not dozens.

Checkpoint: Why is early stopping on "challenger winning" dangerous?

Answer

Peeking inflates false positive rate — multiple looks at the same data favor random wins. Use fixed horizons, sequential testing with correction, or Bayesian methods with pre-registered rules.


Off-policy evaluation recap

Estimate J(π_challenger) from logs collected under π_behavior:

  • Importance weight ρ = π_challenger(a|s) / π_behavior(a|s) — truncate extreme weights.
  • Doubly robust — combine direct method + IS for lower variance.
  • FQE — fit Q^π on data, report average Q(s, π(s)).

OPE is biased if behavior support is thin — always validate on small live canary when possible.


Drift detection

SignalMethod
State distributionKL, PSI, MMD on features
Action distributionHistogram vs train
Reward model gapPredicted vs actual r
Q disagreementEnsemble spread on live batch

Trigger retrain or fallback to BC when drift exceeds threshold.


Dashboard layout (suggested)

  1. Overview — return, cost, traffic split.
  2. Components — reward terms stacked.
  3. Slices — per region, device, cohort.
  4. System — latency, errors, GPU.
  5. Experiments — active A/B status.

Annotate deploy events on time series to correlate regressions.


Incident response

  1. Rollback policy version.
  2. Freeze exploration.
  3. Pull last N hours logs for replay in sim.
  4. Root cause — data bug vs model vs env change.
  5. Postmortem — update constraints / tests.

Common mistakes

MistakeSymptomFix
Scalar return onlyMystery regressionsLog components
No policy version in logsCannot rollback correlateVersion tag
A/B without guardrail costsUnsafe challenger winsMulti-metric gate
Trust OPE aloneDeploy failureCanary + OPE
Ignore seasonalityFalse drift alertsCompare YoY / adjust
Eval greedy while train stochasticTrain/serve skewMatch modes

Closing

Production RL is a closed loop: deploy, measure, compare, detect drift, retrain or rollback. You have now covered the full Deep RL track arc — from MDPs and tabular methods through deep actors, model-based planning, continuous control, and operating policies responsibly in the real world. The Module 9 project ties logging and serving into a minimal production-style pipeline.


Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.