← Back to curriculum

Module 2 — Tabular methods

On-policy vs off-policy

Behavior vs target policy, importance sampling preview, and algorithm choice.

~50 min read + exercises

On-policy vs off-policy

Before we begin

On-policy methods learn about the policy being executed — including exploration. Off-policy methods learn a target policy (often greedy optimal) from data generated by a different behavior policy (often ε-greedy). The split affects sample efficiency, stability, and whether you can learn from a replay buffer or human demos.

Q-learning vs SARSA is the tabular case; DQN and SAC make the same distinction at scale.


What you will learn

  • Define target policy vs behavior policy.
  • Classify MC, SARSA, Q-learning, and behavior cloning.
  • Explain importance sampling at a high level.
  • List trade-offs: stability, exploration risk, data reuse.
  • Recognize off-policy pitfalls (deadly triad preview).

Two policies

PolicySymbolRole
BehaviorbGenerates experience (actions in env)
TargetπWhat we want values / improvements for

On-policy: b = π (same policy, usually stochastic for exploration).
Off-policy: b ≠ π (learn greedy while acting ε-greedy, or learn from logs).


Algorithm classification

MethodOn / offNotes
SARSAOn-policyTarget includes a′ from π
MC control (ε-soft)On-policyReturns under π
Q-learningOff-policymax backup ≠ behavior
DQNOff-policyReplay from past b
PPOOn-policyFresh rollouts only
Behavior cloningOff-policyStatic dataset, π = learner

Why off-policy is attractive

  1. Sample reuse — experience replay stores transitions; train many times.
  2. Exploration flexibility — behave randomly, learn optimal Q*.
  3. Learning from logs — historical data from humans or old policies (offline RL, Module 9).

Importance sampling sketch

Correct for distribution mismatch:

weight = π(a|s) / b(a|s)

MC off-policy reweights returns; TD off-policy uses per-action corrections (e.g. Tree Backup, Retrace). High variance when b is very different from π.

Numeric toy: b picks action uniformly (prob 0.5 each); π is deterministic on best action (prob 1). Rare action under b that π wants gets large weight — noisy.


Cliff walking revisited

SARSA (on)Q-learning (off)
Values reflectRisk of ε-slippingOptimal greedy path
Training pathCautiousNear cliff
Deploy greedySafer if ε was nontrivial during trainOptimal if converged

Stability preview

Off-policy + function approximation + bootstrapping = deadly triad (Module 3). Tabular Q-learning is safe; DQN needs replay + target nets. On-policy methods (PPO) trade sample reuse for stability.

Checkpoint: Can you use old replay data forever in PPO?

Answer

No — PPO is on-policy; stale data was collected under an old policy and does not match current π. Fresh rollouts each iteration.


When to prefer which

SituationLean
Safety during training mattersOn-policy (SARSA, PPO)
Max sample efficiency, replayOff-policy (Q-learning, SAC)
Learning from fixed datasetOff-policy / offline RL
Simple tabular gridEither; Q-learning common

Common mistakes

  • Calling DQN “on-policy” because one ε-greedy agent collects data — target is still greedy Q*.
  • Reusing replay without understanding non-stationary targets.
  • Ignoring that off-policy MC needs importance weights or specialized TD.
  • Assuming off-policy always beats on-policy on wall-clock — replay helps but instability costs tuning.

You can now place every major RL algorithm on the on/off-policy axis. Module 3 asks what happens when state spaces are too large for tables — function approximation enters, and stability becomes the central challenge.


Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.