Safety, alignment & deployment

Before we begin

Deploying RL is not the same as solving a benchmark. Policies can maximize reward while violating constraints — unsafe torques, biased recommendations, reward hacking. Safe RL and alignment aim to optimize task performance subject to safety, fairness, and human intent. Production needs guardrails outside the learned policy.

Constrained MDP — maximize return subject to cost ≤ budget.
Reward hacking — policy exploits misspecified reward.
Shield / filter — override unsafe actions at runtime.

What you will learn

Formulate CMDPs with cost functions and Lagrangian methods.
List reward hacking examples and specification pitfalls.
Apply action shields, human-in-the-loop, and kill switches.
Connect alignment to offline conservative RL and monitoring.
Draft a deployment readiness checklist for RL services.

Constrained MDPs

Standard MDP maximizes E[Σ γᵗ rₜ]. CMDP adds costs cₜ:

maximize return subject to E[Σ γᵗ cₜ] ≤ C

Approach	Mechanism
Lagrangian	Optimize return − λ(cost − C); learn λ
Primal-dual	Alternate policy and λ updates
CPO / TRPO variants	Trust region with cost constraints
Reward shaping penalty	Add −λc to reward (approximate)

python

# Lagrangian reward sketch
lagrangian_reward = r - lambda_cost * c
# Update lambda to push cost toward budget
lambda_cost += lr_lambda * (episode_cost - C_budget)

Reward hacking and misspecification

Environment	Hack
Boat race (Coast Runners)	Loop to collect power-ups, never finish
Grasping	Vibrate object for contact reward
Recommendation	Clickbait maximizing clicks not satisfaction
Chatbot RLHF proxy	Verbose flattering text, not truth

Lesson: reward is a proxy. Monitor downstream metrics humans care about.

Checkpoint: Why doesn't a perfect simulator eliminate reward hacking?

Answer

Hacking exploits misspecified reward, not only sim error. Even with perfect physics, if reward counts "grasp contact" without lift, the policy hacks contact. Alignment fixes objective design and constraints, not just dynamics.

Runtime safety layers

Hard constraints — joint limits, max speed (clip before actuators).
Shield — verify action against safety automaton; project to safe set.
Uncertainty gate — if ensemble Q disagrees, fall back to safe policy.
Human override — teleop takeover always available.
Canary deploy — 1% traffic, escalate if cost metrics OK.

Never rely on the neural net alone for irreversible actions.

Human feedback and RLHF (orientation)

For language and preference tasks: learn reward model from human rankings, optimize policy with PPO (RLHF). Risks: reward model overoptimization, distribution shift. Related to offline RL — data is logged preferences.

Stage	Role
SFT	Behavior cloning on demonstrations
Reward model	Predict human preference
RL fine-tune	PPO with KL to SFT

Fairness and robustness

Subpopulation performance — eval per segment, not only average return.
Adversarial observations — sensor attacks on policies.
Explainability — log state features driving actions for audits.

Deployment checklist

Item	Pass criteria
Sim + OPE / offline eval	Beats BC, cost under budget
Stress tests	DR, adversarial noise
Latency budget	Inference ms within SLA
Rollback	Previous policy one click away
Incident runbook	Who kills the policy?
Compliance	Data retention, consent

Common mistakes

Mistake	Symptom	Fix
Reward = only proxy metric	User harm	Multi-objective + human eval
No cost constraint	Safety incidents	CMDP / shields
RLHF without KL trust region	Gibberish policy	KL to reference
Skip shadow mode	Surprise prod failures	Parallel logging
No post-deploy monitoring	Slow drift detection	Lesson 5 metrics

Closing

Safe deployment treats the policy as one component in a constrained, monitored system. Specify rewards skeptically, enforce hard limits outside learning, and keep humans in the loop for high-stakes domains. Final lesson: metrics and evaluation that run continuously in production.

Before this lesson

Previous lesson

What's next

Next lesson — Monitoring & evaluation in production