On-policy vs off-policy

Before we begin

On-policy methods learn about the policy being executed — including exploration. Off-policy methods learn a target policy (often greedy optimal) from data generated by a different behavior policy (often ε-greedy). The split affects sample efficiency, stability, and whether you can learn from a replay buffer or human demos.

Q-learning vs SARSA is the tabular case; DQN and SAC make the same distinction at scale.

What you will learn

Define target policy vs behavior policy.
Classify MC, SARSA, Q-learning, and behavior cloning.
Explain importance sampling at a high level.
List trade-offs: stability, exploration risk, data reuse.
Recognize off-policy pitfalls (deadly triad preview).

Two policies

Policy	Symbol	Role
Behavior	b	Generates experience (actions in env)
Target	π	What we want values / improvements for

On-policy: b = π (same policy, usually stochastic for exploration).
Off-policy: b ≠ π (learn greedy while acting ε-greedy, or learn from logs).

Algorithm classification

Method	On / off	Notes
SARSA	On-policy	Target includes a′ from π
MC control (ε-soft)	On-policy	Returns under π
Q-learning	Off-policy	max backup ≠ behavior
DQN	Off-policy	Replay from past b
PPO	On-policy	Fresh rollouts only
Behavior cloning	Off-policy	Static dataset, π = learner

Why off-policy is attractive

Sample reuse — experience replay stores transitions; train many times.
Exploration flexibility — behave randomly, learn optimal Q*.
Learning from logs — historical data from humans or old policies (offline RL, Module 9).

Importance sampling sketch

Correct for distribution mismatch:

weight = π(a|s) / b(a|s)

MC off-policy reweights returns; TD off-policy uses per-action corrections (e.g. Tree Backup, Retrace). High variance when b is very different from π.

Numeric toy: b picks action uniformly (prob 0.5 each); π is deterministic on best action (prob 1). Rare action under b that π wants gets large weight — noisy.

Cliff walking revisited

	SARSA (on)	Q-learning (off)
Values reflect	Risk of ε-slipping	Optimal greedy path
Training path	Cautious	Near cliff
Deploy greedy	Safer if ε was nontrivial during train	Optimal if converged

Stability preview

Off-policy + function approximation + bootstrapping = deadly triad (Module 3). Tabular Q-learning is safe; DQN needs replay + target nets. On-policy methods (PPO) trade sample reuse for stability.

Checkpoint: Can you use old replay data forever in PPO?

Answer

No — PPO is on-policy; stale data was collected under an old policy and does not match current π. Fresh rollouts each iteration.

When to prefer which

Situation	Lean
Safety during training matters	On-policy (SARSA, PPO)
Max sample efficiency, replay	Off-policy (Q-learning, SAC)
Learning from fixed dataset	Off-policy / offline RL
Simple tabular grid	Either; Q-learning common

Common mistakes

Calling DQN “on-policy” because one ε-greedy agent collects data — target is still greedy Q*.
Reusing replay without understanding non-stationary targets.
Ignoring that off-policy MC needs importance weights or specialized TD.
Assuming off-policy always beats on-policy on wall-clock — replay helps but instability costs tuning.

You can now place every major RL algorithm on the on/off-policy axis. Module 3 asks what happens when state spaces are too large for tables — function approximation enters, and stability becomes the central challenge.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.