Offline RL & batch constraints

Before we begin

Many production systems cannot run exploratory RL online — recommender systems, medical dosing, and industrial controllers must learn from historical logs collected by a legacy policy. Offline RL (batch RL) estimates a better policy from fixed data without new environment interaction. Done wrong, it overestimates Q for actions never seen in the dataset — catastrophically in deployment.

Offline RL — train from static dataset D = (s, a, r, s′) only.
Distribution shift — learned policy queries Q at actions not in data.
Conservative methods — penalize OOD actions (CQL, IQL, BC regularization).

What you will learn

Contrast online, off-policy with replay, and pure offline RL.
Explain extrapolation error in Q-learning on fixed data.
Name CQL, IQL, BCQ and their core idea in one sentence each.
Apply behavior cloning baselines and when they win.
Design datasets and eval for batch settings (OPE).

Why offline RL is hard

Off-policy DQN uses replay but keeps collecting new data that covers mistakes. Offline: dataset fixed. The critic evaluates π(s) actions that may never appear in D — neural Q extrapolates optimistically.

Setting	New data?	Risk
Online RL	Yes	Safe exploration cost
Off-policy + env	Yes	Moderate
Offline only	No	Q blow-up on OOD actions

Deadly triad is worse: function approximation + bootstrapping + no corrective visits.

Behavior cloning baseline

Supervised learning: π(a|s) from (s, a) pairs in logs. Works when data is expert and covers states you will see.

python

# BC sketch — often strong baseline
for s, a in dataset:
    loss = F.mse_loss(policy(s), a)  # or cross-entropy for discrete
loss.backward()

BC strength	BC weakness
Stable, simple	Cannot improve beyond data
Good for narrow tasks	Covariate shift — errors compound

DAgger (interactive) queries expert after visiting new states — not pure offline but bridges the gap.

Conservative Q-Learning (CQL) — intuition

Add penalty pushing down Q on out-of-distribution actions while fitting Bellman on data actions:

Loss = Bellman error + α ( E_a~~μ[Q(s,a)] − E_a~~data[Q(s,a)] )

μ samples broadly; data distribution supported by logs. Result: pessimistic Q away from data — policy prefers actions similar to logging policy when uncertain.

Worked intuition

Dataset only contains slow driving actions. Standard Q might assign huge Q to unseen "floor it" action. CQL suppresses those Q values → learned policy stays near safe behaviors unless data proves high reward elsewhere.

Checkpoint: When does offline RL offer no benefit over BC?

Answer

When data is uniformly expert, task is narrow, and you cannot deploy improvements outside the data support — BC matches performance with less risk. Offline RL helps when data is suboptimal but diverse enough to infer better trade-offs.

Other algorithms (one-liners)

Algorithm	Core idea
IQL	Expectile regression on Q; avoid max over OOD actions
BCQ	Perturb behavior policy actions; stay in data hull
TD3+BC	TD3 + BC penalty toward dataset actions
Decision Transformer	Sequence modeling; condition on desired return

Dataset design matters

Property	Why
Coverage	Diverse (s,a) — not one trajectory
Quality mix	Some good episodes to learn from
Metadata	Logging policy version, time, context
No leakage	Train/val split by time or user
Safety labels	Near-miss flags for constraints

Garbage logs → garbage policy, offline or not.

Offline policy evaluation (OPE)

Before deploy, estimate policy value from logs:

Importance sampling — needs behavior policy π_b known; high variance.
Fitted Q evaluation — learn Q on data, evaluate π.
Doubly robust — combines model + IS.

Run A/B or shadow mode when possible — OPE is approximate.

Common mistakes

Mistake	Symptom	Fix
Standard DQN offline	Q explodes	CQL / IQL / BC mix
No behavior policy info	Bad OPE	Log π_b or propensities
Train on test users	Inflated metrics	Temporal split
Ignoring BC baseline	Complex failure	Always compare BC
Deploy without constraints	Safety incidents	Filters + human review

Closing

Offline RL is how RL meets logged production data. Treat extrapolation as the enemy; use conservative objectives and strong BC baselines. Exploration and multi-agent topics next broaden the problem class; safety and monitoring close the loop on responsible deployment.

Before this lesson

Previous lesson

What's next

Next lesson — Exploration & intrinsic motivation