Q-learning & SARSA
Before we begin
SARSA (on-policy) learns Q^π for the policy you follow, including exploration. Q-learning (off-policy) learns Q* toward the greedy optimal policy while behaving ε-greedy. Both are one-line changes to the TD backup — but that line changes safety and convergence stories.
Mastering the update rules prevents mixing them up in implementations and interviews.
What you will learn
- Write SARSA and Q-learning update equations.
- Implement tabular Q-learning on a small grid.
- Explain why Q-learning uses max over next actions.
- Compare behavior on cliff walking (safe vs shortest path).
- Choose learning rate α and exploration ε schedules.
SARSA (on-policy TD control)
After taking a in s, seeing r, s′, and choosing a′ per policy:
Q(s, a) ← Q(s, a) + α [ r + γ Q(s′, a′) − Q(s, a) ]
Name: State Action Reward State Action.
def sarsa_step(Q, s, a, r, s_next, a_next, gamma, alpha):
target = r + gamma * Q[s_next, a_next]
Q[s, a] += alpha * (target - Q[s, a])Q-learning (off-policy TD control)
Same transition, but target uses best next action:
Q(s, a) ← Q(s, a) + α [ r + γ max_a′ Q(s′, a′) − Q(s, a) ]
import numpy as np
def q_learning_step(Q, s, a, r, s_next, gamma, alpha, terminal=False):
best_next = 0 if terminal else np.max(Q[s_next])
target = r + gamma * best_next
Q[s, a] += alpha * (target - Q[s, a])| Algorithm | Bootstraps from |
|---|---|
| SARSA | Q(s′, a′) actually taken |
| Q-learning | max_a′ Q(s′, a′) |
Cliff walking contrast
Grid: shortest path skims a cliff (large negative reward if stepped off). ε-greedy exploration.
| Algorithm | Typical path | Why |
|---|---|---|
| Q-learning | Along cliff edge | Learns optimal Q*; risky steps during explore hurt less in value of optimal policy |
| SARSA | Safer route inland | Updates on actual risky a′ — learns π that accounts for ε-slipping |
Checkpoint: You deploy with ε=0 (greedy only). Which algorithm’s Q table matches deployment?
Answer
Q-learning targets optimal greedy policy. SARSA Q reflects ε-soft behavior during training — with ε→0 and enough visits, both can converge to optimal on benign MDPs, but paths during learning differ.
ε-greedy behavior policy
def epsilon_greedy(Q, s, epsilon, n_actions, rng):
if rng.random() < epsilon:
return rng.integers(n_actions)
return int(np.argmax(Q[s]))Common schedule: ε starts 1.0 → decays to 0.05 over first 10k steps.
Convergence conditions (tabular)
Q-learning converges to Q* with:
- All (s, a) visited infinitely often
- α satisfies Robbins–Monro: Σα = ∞, Σα² < ∞ (or fixed small α in practice)
- Standard MDP conditions
SARSA converges to Q^π for the behavior policy π (ε-soft).
Hyperparameter table
| Knob | Too low | Too high |
|---|---|---|
| α | Slow learning | Oscillation / divergence |
| ε | Stuck in local policy | Random walk, ignores learning |
| γ | Myopic | Slow credit to distant goals |
Common mistakes
- Using max in SARSA (that is Q-learning).
- Forgetting to set Q(terminal, ·) = 0 or skip max at terminal s′.
- Argmax tie-breaking arbitrary — multiple optimal actions OK.
- One update per episode — need many passes over transitions.
Q-learning is the direct ancestor of DQN. First understand on-policy vs off-policy deeply — the distinction drives safety, data reuse, and algorithm design.