Safety, alignment & deployment
Before we begin
Deploying RL is not the same as solving a benchmark. Policies can maximize reward while violating constraints — unsafe torques, biased recommendations, reward hacking. Safe RL and alignment aim to optimize task performance subject to safety, fairness, and human intent. Production needs guardrails outside the learned policy.
Constrained MDP — maximize return subject to cost ≤ budget.
Reward hacking — policy exploits misspecified reward.
Shield / filter — override unsafe actions at runtime.
What you will learn
- Formulate CMDPs with cost functions and Lagrangian methods.
- List reward hacking examples and specification pitfalls.
- Apply action shields, human-in-the-loop, and kill switches.
- Connect alignment to offline conservative RL and monitoring.
- Draft a deployment readiness checklist for RL services.
Constrained MDPs
Standard MDP maximizes E[Σ γᵗ rₜ]. CMDP adds costs cₜ:
maximize return subject to E[Σ γᵗ cₜ] ≤ C
| Approach | Mechanism |
|---|---|
| Lagrangian | Optimize return − λ(cost − C); learn λ |
| Primal-dual | Alternate policy and λ updates |
| CPO / TRPO variants | Trust region with cost constraints |
| Reward shaping penalty | Add −λc to reward (approximate) |
# Lagrangian reward sketch
lagrangian_reward = r - lambda_cost * c
# Update lambda to push cost toward budget
lambda_cost += lr_lambda * (episode_cost - C_budget)Reward hacking and misspecification
| Environment | Hack |
|---|---|
| Boat race (Coast Runners) | Loop to collect power-ups, never finish |
| Grasping | Vibrate object for contact reward |
| Recommendation | Clickbait maximizing clicks not satisfaction |
| Chatbot RLHF proxy | Verbose flattering text, not truth |
Lesson: reward is a proxy. Monitor downstream metrics humans care about.
Checkpoint: Why doesn't a perfect simulator eliminate reward hacking?
Answer
Hacking exploits misspecified reward, not only sim error. Even with perfect physics, if reward counts "grasp contact" without lift, the policy hacks contact. Alignment fixes objective design and constraints, not just dynamics.
Runtime safety layers
- Hard constraints — joint limits, max speed (clip before actuators).
- Shield — verify action against safety automaton; project to safe set.
- Uncertainty gate — if ensemble Q disagrees, fall back to safe policy.
- Human override — teleop takeover always available.
- Canary deploy — 1% traffic, escalate if cost metrics OK.
Never rely on the neural net alone for irreversible actions.
Human feedback and RLHF (orientation)
For language and preference tasks: learn reward model from human rankings, optimize policy with PPO (RLHF). Risks: reward model overoptimization, distribution shift. Related to offline RL — data is logged preferences.
| Stage | Role |
|---|---|
| SFT | Behavior cloning on demonstrations |
| Reward model | Predict human preference |
| RL fine-tune | PPO with KL to SFT |
Fairness and robustness
- Subpopulation performance — eval per segment, not only average return.
- Adversarial observations — sensor attacks on policies.
- Explainability — log state features driving actions for audits.
Deployment checklist
| Item | Pass criteria |
|---|---|
| Sim + OPE / offline eval | Beats BC, cost under budget |
| Stress tests | DR, adversarial noise |
| Latency budget | Inference ms within SLA |
| Rollback | Previous policy one click away |
| Incident runbook | Who kills the policy? |
| Compliance | Data retention, consent |
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Reward = only proxy metric | User harm | Multi-objective + human eval |
| No cost constraint | Safety incidents | CMDP / shields |
| RLHF without KL trust region | Gibberish policy | KL to reference |
| Skip shadow mode | Surprise prod failures | Parallel logging |
| No post-deploy monitoring | Slow drift detection | Lesson 5 metrics |
Closing
Safe deployment treats the policy as one component in a constrained, monitored system. Specify rewards skeptically, enforce hard limits outside learning, and keep humans in the loop for high-stakes domains. Final lesson: metrics and evaluation that run continuously in production.