Q-learning & SARSA

Before we begin

SARSA (on-policy) learns Q^π for the policy you follow, including exploration. Q-learning (off-policy) learns Q* toward the greedy optimal policy while behaving ε-greedy. Both are one-line changes to the TD backup — but that line changes safety and convergence stories.

Mastering the update rules prevents mixing them up in implementations and interviews.

What you will learn

Write SARSA and Q-learning update equations.
Implement tabular Q-learning on a small grid.
Explain why Q-learning uses max over next actions.
Compare behavior on cliff walking (safe vs shortest path).
Choose learning rate α and exploration ε schedules.

SARSA (on-policy TD control)

After taking a in s, seeing r, s′, and choosing a′ per policy:

Q(s, a) ← Q(s, a) + α [ r + γ Q(s′, a′) − Q(s, a) ]

Name: State Action Reward State Action.

python

def sarsa_step(Q, s, a, r, s_next, a_next, gamma, alpha):
    target = r + gamma * Q[s_next, a_next]
    Q[s, a] += alpha * (target - Q[s, a])

Q-learning (off-policy TD control)

Same transition, but target uses best next action:

Q(s, a) ← Q(s, a) + α [ r + γ max_a′ Q(s′, a′) − Q(s, a) ]

python

import numpy as np
 
def q_learning_step(Q, s, a, r, s_next, gamma, alpha, terminal=False):
    best_next = 0 if terminal else np.max(Q[s_next])
    target = r + gamma * best_next
    Q[s, a] += alpha * (target - Q[s, a])

Algorithm	Bootstraps from
SARSA	Q(s′, a′) actually taken
Q-learning	max_a′ Q(s′, a′)

Cliff walking contrast

Grid: shortest path skims a cliff (large negative reward if stepped off). ε-greedy exploration.

Algorithm	Typical path	Why
Q-learning	Along cliff edge	Learns optimal Q*; risky steps during explore hurt less in value of optimal policy
SARSA	Safer route inland	Updates on actual risky a′ — learns π that accounts for ε-slipping

Checkpoint: You deploy with ε=0 (greedy only). Which algorithm’s Q table matches deployment?

Answer

Q-learning targets optimal greedy policy. SARSA Q reflects ε-soft behavior during training — with ε→0 and enough visits, both can converge to optimal on benign MDPs, but paths during learning differ.

ε-greedy behavior policy

python

def epsilon_greedy(Q, s, epsilon, n_actions, rng):
    if rng.random() < epsilon:
        return rng.integers(n_actions)
    return int(np.argmax(Q[s]))

Common schedule: ε starts 1.0 → decays to 0.05 over first 10k steps.

Convergence conditions (tabular)

Q-learning converges to Q* with:

All (s, a) visited infinitely often
α satisfies Robbins–Monro: Σα = ∞, Σα² < ∞ (or fixed small α in practice)
Standard MDP conditions

SARSA converges to Q^π for the behavior policy π (ε-soft).

Hyperparameter table

Knob	Too low	Too high
α	Slow learning	Oscillation / divergence
ε	Stuck in local policy	Random walk, ignores learning
γ	Myopic	Slow credit to distant goals

Common mistakes

Using max in SARSA (that is Q-learning).
Forgetting to set Q(terminal, ·) = 0 or skip max at terminal s′.
Argmax tie-breaking arbitrary — multiple optimal actions OK.
One update per episode — need many passes over transitions.

Q-learning is the direct ancestor of DQN. First understand on-policy vs off-policy deeply — the distinction drives safety, data reuse, and algorithm design.

Before this lesson

Previous lesson

What's next

Next lesson — On-policy vs off-policy