Monitoring & evaluation in production

Before we begin

A policy shipped to production drifts as the world changes — new users, seasonal demand, hardware wear. Monitoring tracks returns, costs, and data distribution; evaluation estimates whether a new policy beats the old one without reckless A/B risk. RL systems need ML ops plus bandit experiment design.

On-policy production eval — measure deployed policy on live traffic (with exploration budget).
Off-policy evaluation (OPE) — estimate new policy from logs.
Distribution shift — feature or reward distribution changes over time.

What you will learn

Define SLIs for RL services: return, cost, latency, violation rate.
Design A/B tests and interleaving for policy comparison.
Run OPE with importance sampling and doubly robust estimators (intuition).
Detect data drift on states and action distributions.
Build dashboards and alerting for RL-specific failures.

What to log every step

Field	Use
state / context	sₜ or features
action	aₜ, propensity if stochastic
reward components	rₜ decomposed
cost / constraint	cₜ for CMDP
policy version	model hash, config
behavior policy prob	π_b(a
latency	inference ms
episode id	aggregate trajectories

Missing propensities blocks reliable OPE later.

python

log = dict(
    state_feats=obs.tolist(),
    action=int(a),
    reward=float(r),
    policy_version="sac-v3.2",
    log_prob=float(log_pi),
    behavior_log_prob=float(log_b),  # if behavior != target
    ts=time.time(),
)

Service level indicators (RL)

SLI	Alert if
Mean episodic return	Drops > X% vs 7d baseline
Cost rate	Exceeds CMDP budget
Constraint violations	Any spike
p99 inference latency	Above SLA
Exploration rate	Accidental zero
Action saturation	Stuck on boundary

Decompose reward — a drop in one component localizes bugs faster than scalar return alone.

A/B testing policies

Traffic split — 95% champion, 5% challenger (if safe).
Contextual — assign by user id hash for consistency.
Duration — long enough for episodic tasks (full user journeys).
Guardrails — auto-promote challenger only if return ↑ and cost OK.

Interleaving (two policies alternate steps) increases sensitivity in some recommender settings — harder for long-horizon control.

Worked power example

Baseline return 100, std 20, want to detect +2% lift (Δ=2). Rough sample size grows with variance — plan thousands of episodes per arm, not dozens.

Checkpoint: Why is early stopping on "challenger winning" dangerous?

Answer

Peeking inflates false positive rate — multiple looks at the same data favor random wins. Use fixed horizons, sequential testing with correction, or Bayesian methods with pre-registered rules.

Off-policy evaluation recap

Estimate J(π_challenger) from logs collected under π_behavior:

Importance weight ρ = π_challenger(a|s) / π_behavior(a|s) — truncate extreme weights.
Doubly robust — combine direct method + IS for lower variance.
FQE — fit Q^π on data, report average Q(s, π(s)).

OPE is biased if behavior support is thin — always validate on small live canary when possible.

Drift detection

Signal	Method
State distribution	KL, PSI, MMD on features
Action distribution	Histogram vs train
Reward model gap	Predicted vs actual r
Q disagreement	Ensemble spread on live batch

Trigger retrain or fallback to BC when drift exceeds threshold.

Dashboard layout (suggested)

Overview — return, cost, traffic split.
Components — reward terms stacked.
Slices — per region, device, cohort.
System — latency, errors, GPU.
Experiments — active A/B status.

Annotate deploy events on time series to correlate regressions.

Incident response

Rollback policy version.
Freeze exploration.
Pull last N hours logs for replay in sim.
Root cause — data bug vs model vs env change.
Postmortem — update constraints / tests.

Common mistakes

Mistake	Symptom	Fix
Scalar return only	Mystery regressions	Log components
No policy version in logs	Cannot rollback correlate	Version tag
A/B without guardrail costs	Unsafe challenger wins	Multi-metric gate
Trust OPE alone	Deploy failure	Canary + OPE
Ignore seasonality	False drift alerts	Compare YoY / adjust
Eval greedy while train stochastic	Train/serve skew	Match modes

Closing

Production RL is a closed loop: deploy, measure, compare, detect drift, retrain or rollback. You have now covered the full Deep RL track arc — from MDPs and tabular methods through deep actors, model-based planning, continuous control, and operating policies responsibly in the real world. The Module 9 project ties logging and serving into a minimal production-style pipeline.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.