On-policy vs off-policy
Before we begin
On-policy methods learn about the policy being executed — including exploration. Off-policy methods learn a target policy (often greedy optimal) from data generated by a different behavior policy (often ε-greedy). The split affects sample efficiency, stability, and whether you can learn from a replay buffer or human demos.
Q-learning vs SARSA is the tabular case; DQN and SAC make the same distinction at scale.
What you will learn
- Define target policy vs behavior policy.
- Classify MC, SARSA, Q-learning, and behavior cloning.
- Explain importance sampling at a high level.
- List trade-offs: stability, exploration risk, data reuse.
- Recognize off-policy pitfalls (deadly triad preview).
Two policies
| Policy | Symbol | Role |
|---|---|---|
| Behavior | b | Generates experience (actions in env) |
| Target | π | What we want values / improvements for |
On-policy: b = π (same policy, usually stochastic for exploration).
Off-policy: b ≠ π (learn greedy while acting ε-greedy, or learn from logs).
Algorithm classification
| Method | On / off | Notes |
|---|---|---|
| SARSA | On-policy | Target includes a′ from π |
| MC control (ε-soft) | On-policy | Returns under π |
| Q-learning | Off-policy | max backup ≠ behavior |
| DQN | Off-policy | Replay from past b |
| PPO | On-policy | Fresh rollouts only |
| Behavior cloning | Off-policy | Static dataset, π = learner |
Why off-policy is attractive
- Sample reuse — experience replay stores transitions; train many times.
- Exploration flexibility — behave randomly, learn optimal Q*.
- Learning from logs — historical data from humans or old policies (offline RL, Module 9).
Importance sampling sketch
Correct for distribution mismatch:
weight = π(a|s) / b(a|s)
MC off-policy reweights returns; TD off-policy uses per-action corrections (e.g. Tree Backup, Retrace). High variance when b is very different from π.
Numeric toy: b picks action uniformly (prob 0.5 each); π is deterministic on best action (prob 1). Rare action under b that π wants gets large weight — noisy.
Cliff walking revisited
| SARSA (on) | Q-learning (off) | |
|---|---|---|
| Values reflect | Risk of ε-slipping | Optimal greedy path |
| Training path | Cautious | Near cliff |
| Deploy greedy | Safer if ε was nontrivial during train | Optimal if converged |
Stability preview
Off-policy + function approximation + bootstrapping = deadly triad (Module 3). Tabular Q-learning is safe; DQN needs replay + target nets. On-policy methods (PPO) trade sample reuse for stability.
Checkpoint: Can you use old replay data forever in PPO?
Answer
No — PPO is on-policy; stale data was collected under an old policy and does not match current π. Fresh rollouts each iteration.
When to prefer which
| Situation | Lean |
|---|---|
| Safety during training matters | On-policy (SARSA, PPO) |
| Max sample efficiency, replay | Off-policy (Q-learning, SAC) |
| Learning from fixed dataset | Off-policy / offline RL |
| Simple tabular grid | Either; Q-learning common |
Common mistakes
- Calling DQN “on-policy” because one ε-greedy agent collects data — target is still greedy Q*.
- Reusing replay without understanding non-stationary targets.
- Ignoring that off-policy MC needs importance weights or specialized TD.
- Assuming off-policy always beats on-policy on wall-clock — replay helps but instability costs tuning.
You can now place every major RL algorithm on the on/off-policy axis. Module 3 asks what happens when state spaces are too large for tables — function approximation enters, and stability becomes the central challenge.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.