← Back to curriculum

Module 9 — Production & advanced topics

Safety, alignment & deployment

Constrained RL, human oversight, reward hacking, and guardrails.

~55 min read + exercises

Safety, alignment & deployment

Before we begin

Deploying RL is not the same as solving a benchmark. Policies can maximize reward while violating constraints — unsafe torques, biased recommendations, reward hacking. Safe RL and alignment aim to optimize task performance subject to safety, fairness, and human intent. Production needs guardrails outside the learned policy.

Constrained MDP — maximize return subject to cost ≤ budget.
Reward hacking — policy exploits misspecified reward.
Shield / filter — override unsafe actions at runtime.


What you will learn

  • Formulate CMDPs with cost functions and Lagrangian methods.
  • List reward hacking examples and specification pitfalls.
  • Apply action shields, human-in-the-loop, and kill switches.
  • Connect alignment to offline conservative RL and monitoring.
  • Draft a deployment readiness checklist for RL services.

Constrained MDPs

Standard MDP maximizes E[Σ γᵗ rₜ]. CMDP adds costs cₜ:

maximize return subject to E[Σ γᵗ cₜ] ≤ C

ApproachMechanism
LagrangianOptimize return − λ(cost − C); learn λ
Primal-dualAlternate policy and λ updates
CPO / TRPO variantsTrust region with cost constraints
Reward shaping penaltyAdd −λc to reward (approximate)
python
# Lagrangian reward sketch
lagrangian_reward = r - lambda_cost * c
# Update lambda to push cost toward budget
lambda_cost += lr_lambda * (episode_cost - C_budget)

Reward hacking and misspecification

EnvironmentHack
Boat race (Coast Runners)Loop to collect power-ups, never finish
GraspingVibrate object for contact reward
RecommendationClickbait maximizing clicks not satisfaction
Chatbot RLHF proxyVerbose flattering text, not truth

Lesson: reward is a proxy. Monitor downstream metrics humans care about.

Checkpoint: Why doesn't a perfect simulator eliminate reward hacking?

Answer

Hacking exploits misspecified reward, not only sim error. Even with perfect physics, if reward counts "grasp contact" without lift, the policy hacks contact. Alignment fixes objective design and constraints, not just dynamics.


Runtime safety layers

  1. Hard constraints — joint limits, max speed (clip before actuators).
  2. Shield — verify action against safety automaton; project to safe set.
  3. Uncertainty gate — if ensemble Q disagrees, fall back to safe policy.
  4. Human override — teleop takeover always available.
  5. Canary deploy — 1% traffic, escalate if cost metrics OK.

Never rely on the neural net alone for irreversible actions.


Human feedback and RLHF (orientation)

For language and preference tasks: learn reward model from human rankings, optimize policy with PPO (RLHF). Risks: reward model overoptimization, distribution shift. Related to offline RL — data is logged preferences.

StageRole
SFTBehavior cloning on demonstrations
Reward modelPredict human preference
RL fine-tunePPO with KL to SFT

Fairness and robustness

  • Subpopulation performance — eval per segment, not only average return.
  • Adversarial observations — sensor attacks on policies.
  • Explainability — log state features driving actions for audits.

Deployment checklist

ItemPass criteria
Sim + OPE / offline evalBeats BC, cost under budget
Stress testsDR, adversarial noise
Latency budgetInference ms within SLA
RollbackPrevious policy one click away
Incident runbookWho kills the policy?
ComplianceData retention, consent

Common mistakes

MistakeSymptomFix
Reward = only proxy metricUser harmMulti-objective + human eval
No cost constraintSafety incidentsCMDP / shields
RLHF without KL trust regionGibberish policyKL to reference
Skip shadow modeSurprise prod failuresParallel logging
No post-deploy monitoringSlow drift detectionLesson 5 metrics

Closing

Safe deployment treats the policy as one component in a constrained, monitored system. Specify rewards skeptically, enforce hard limits outside learning, and keep humans in the loop for high-stakes domains. Final lesson: metrics and evaluation that run continuously in production.


Before this lesson


What's next