← Back to curriculum

Module 9 — Production & advanced topics

Offline RL & batch constraints

CQL/BCQ intuition, distributional shift, and learning from logged data.

~60 min read + exercises

Offline RL & batch constraints

Before we begin

Many production systems cannot run exploratory RL online — recommender systems, medical dosing, and industrial controllers must learn from historical logs collected by a legacy policy. Offline RL (batch RL) estimates a better policy from fixed data without new environment interaction. Done wrong, it overestimates Q for actions never seen in the dataset — catastrophically in deployment.

Offline RL — train from static dataset D = (s, a, r, s′) only.
Distribution shift — learned policy queries Q at actions not in data.
Conservative methods — penalize OOD actions (CQL, IQL, BC regularization).


What you will learn

  • Contrast online, off-policy with replay, and pure offline RL.
  • Explain extrapolation error in Q-learning on fixed data.
  • Name CQL, IQL, BCQ and their core idea in one sentence each.
  • Apply behavior cloning baselines and when they win.
  • Design datasets and eval for batch settings (OPE).

Why offline RL is hard

Off-policy DQN uses replay but keeps collecting new data that covers mistakes. Offline: dataset fixed. The critic evaluates π(s) actions that may never appear in D — neural Q extrapolates optimistically.

SettingNew data?Risk
Online RLYesSafe exploration cost
Off-policy + envYesModerate
Offline onlyNoQ blow-up on OOD actions

Deadly triad is worse: function approximation + bootstrapping + no corrective visits.


Behavior cloning baseline

Supervised learning: π(a|s) from (s, a) pairs in logs. Works when data is expert and covers states you will see.

python
# BC sketch — often strong baseline
for s, a in dataset:
    loss = F.mse_loss(policy(s), a)  # or cross-entropy for discrete
loss.backward()
BC strengthBC weakness
Stable, simpleCannot improve beyond data
Good for narrow tasksCovariate shift — errors compound

DAgger (interactive) queries expert after visiting new states — not pure offline but bridges the gap.


Conservative Q-Learning (CQL) — intuition

Add penalty pushing down Q on out-of-distribution actions while fitting Bellman on data actions:

Loss = Bellman error + α ( E_aμ[Q(s,a)] − E_adata[Q(s,a)] )

μ samples broadly; data distribution supported by logs. Result: pessimistic Q away from data — policy prefers actions similar to logging policy when uncertain.

Worked intuition

Dataset only contains slow driving actions. Standard Q might assign huge Q to unseen "floor it" action. CQL suppresses those Q values → learned policy stays near safe behaviors unless data proves high reward elsewhere.

Checkpoint: When does offline RL offer no benefit over BC?

Answer

When data is uniformly expert, task is narrow, and you cannot deploy improvements outside the data support — BC matches performance with less risk. Offline RL helps when data is suboptimal but diverse enough to infer better trade-offs.


Other algorithms (one-liners)

AlgorithmCore idea
IQLExpectile regression on Q; avoid max over OOD actions
BCQPerturb behavior policy actions; stay in data hull
TD3+BCTD3 + BC penalty toward dataset actions
Decision TransformerSequence modeling; condition on desired return

Dataset design matters

PropertyWhy
CoverageDiverse (s,a) — not one trajectory
Quality mixSome good episodes to learn from
MetadataLogging policy version, time, context
No leakageTrain/val split by time or user
Safety labelsNear-miss flags for constraints

Garbage logs → garbage policy, offline or not.


Offline policy evaluation (OPE)

Before deploy, estimate policy value from logs:

  • Importance sampling — needs behavior policy π_b known; high variance.
  • Fitted Q evaluation — learn Q on data, evaluate π.
  • Doubly robust — combines model + IS.

Run A/B or shadow mode when possible — OPE is approximate.


Common mistakes

MistakeSymptomFix
Standard DQN offlineQ explodesCQL / IQL / BC mix
No behavior policy infoBad OPELog π_b or propensities
Train on test usersInflated metricsTemporal split
Ignoring BC baselineComplex failureAlways compare BC
Deploy without constraintsSafety incidentsFilters + human review

Closing

Offline RL is how RL meets logged production data. Treat extrapolation as the enemy; use conservative objectives and strong BC baselines. Exploration and multi-agent topics next broaden the problem class; safety and monitoring close the loop on responsible deployment.


Before this lesson


What's next