Offline RL & batch constraints
Before we begin
Many production systems cannot run exploratory RL online — recommender systems, medical dosing, and industrial controllers must learn from historical logs collected by a legacy policy. Offline RL (batch RL) estimates a better policy from fixed data without new environment interaction. Done wrong, it overestimates Q for actions never seen in the dataset — catastrophically in deployment.
Offline RL — train from static dataset D = (s, a, r, s′) only.
Distribution shift — learned policy queries Q at actions not in data.
Conservative methods — penalize OOD actions (CQL, IQL, BC regularization).
What you will learn
- Contrast online, off-policy with replay, and pure offline RL.
- Explain extrapolation error in Q-learning on fixed data.
- Name CQL, IQL, BCQ and their core idea in one sentence each.
- Apply behavior cloning baselines and when they win.
- Design datasets and eval for batch settings (OPE).
Why offline RL is hard
Off-policy DQN uses replay but keeps collecting new data that covers mistakes. Offline: dataset fixed. The critic evaluates π(s) actions that may never appear in D — neural Q extrapolates optimistically.
| Setting | New data? | Risk |
|---|---|---|
| Online RL | Yes | Safe exploration cost |
| Off-policy + env | Yes | Moderate |
| Offline only | No | Q blow-up on OOD actions |
Deadly triad is worse: function approximation + bootstrapping + no corrective visits.
Behavior cloning baseline
Supervised learning: π(a|s) from (s, a) pairs in logs. Works when data is expert and covers states you will see.
# BC sketch — often strong baseline
for s, a in dataset:
loss = F.mse_loss(policy(s), a) # or cross-entropy for discrete
loss.backward()| BC strength | BC weakness |
|---|---|
| Stable, simple | Cannot improve beyond data |
| Good for narrow tasks | Covariate shift — errors compound |
DAgger (interactive) queries expert after visiting new states — not pure offline but bridges the gap.
Conservative Q-Learning (CQL) — intuition
Add penalty pushing down Q on out-of-distribution actions while fitting Bellman on data actions:
Loss = Bellman error + α ( E_aμ[Q(s,a)] − E_adata[Q(s,a)] )
μ samples broadly; data distribution supported by logs. Result: pessimistic Q away from data — policy prefers actions similar to logging policy when uncertain.
Worked intuition
Dataset only contains slow driving actions. Standard Q might assign huge Q to unseen "floor it" action. CQL suppresses those Q values → learned policy stays near safe behaviors unless data proves high reward elsewhere.
Checkpoint: When does offline RL offer no benefit over BC?
Answer
When data is uniformly expert, task is narrow, and you cannot deploy improvements outside the data support — BC matches performance with less risk. Offline RL helps when data is suboptimal but diverse enough to infer better trade-offs.
Other algorithms (one-liners)
| Algorithm | Core idea |
|---|---|
| IQL | Expectile regression on Q; avoid max over OOD actions |
| BCQ | Perturb behavior policy actions; stay in data hull |
| TD3+BC | TD3 + BC penalty toward dataset actions |
| Decision Transformer | Sequence modeling; condition on desired return |
Dataset design matters
| Property | Why |
|---|---|
| Coverage | Diverse (s,a) — not one trajectory |
| Quality mix | Some good episodes to learn from |
| Metadata | Logging policy version, time, context |
| No leakage | Train/val split by time or user |
| Safety labels | Near-miss flags for constraints |
Garbage logs → garbage policy, offline or not.
Offline policy evaluation (OPE)
Before deploy, estimate policy value from logs:
- Importance sampling — needs behavior policy π_b known; high variance.
- Fitted Q evaluation — learn Q on data, evaluate π.
- Doubly robust — combines model + IS.
Run A/B or shadow mode when possible — OPE is approximate.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Standard DQN offline | Q explodes | CQL / IQL / BC mix |
| No behavior policy info | Bad OPE | Log π_b or propensities |
| Train on test users | Inflated metrics | Temporal split |
| Ignoring BC baseline | Complex failure | Always compare BC |
| Deploy without constraints | Safety incidents | Filters + human review |
Closing
Offline RL is how RL meets logged production data. Treat extrapolation as the enemy; use conservative objectives and strong BC baselines. Exploration and multi-agent topics next broaden the problem class; safety and monitoring close the loop on responsible deployment.