← Back to curriculum

Module 2 — Tabular methods

Q-learning & SARSA

Off-policy Q-learning vs on-policy SARSA, update rules, and convergence intuition.

~70 min read + exercises

Q-learning & SARSA

Before we begin

SARSA (on-policy) learns Q^π for the policy you follow, including exploration. Q-learning (off-policy) learns Q* toward the greedy optimal policy while behaving ε-greedy. Both are one-line changes to the TD backup — but that line changes safety and convergence stories.

Mastering the update rules prevents mixing them up in implementations and interviews.


What you will learn

  • Write SARSA and Q-learning update equations.
  • Implement tabular Q-learning on a small grid.
  • Explain why Q-learning uses max over next actions.
  • Compare behavior on cliff walking (safe vs shortest path).
  • Choose learning rate α and exploration ε schedules.

SARSA (on-policy TD control)

After taking a in s, seeing r, s′, and choosing a′ per policy:

Q(s, a) ← Q(s, a) + α [ r + γ Q(s′, a′) − Q(s, a) ]

Name: State Action Reward State Action.

python
def sarsa_step(Q, s, a, r, s_next, a_next, gamma, alpha):
    target = r + gamma * Q[s_next, a_next]
    Q[s, a] += alpha * (target - Q[s, a])

Q-learning (off-policy TD control)

Same transition, but target uses best next action:

Q(s, a) ← Q(s, a) + α [ r + γ max_a′ Q(s′, a′) − Q(s, a) ]

python
import numpy as np
 
def q_learning_step(Q, s, a, r, s_next, gamma, alpha, terminal=False):
    best_next = 0 if terminal else np.max(Q[s_next])
    target = r + gamma * best_next
    Q[s, a] += alpha * (target - Q[s, a])
AlgorithmBootstraps from
SARSAQ(s′, a′) actually taken
Q-learningmax_a′ Q(s′, a′)

Cliff walking contrast

Grid: shortest path skims a cliff (large negative reward if stepped off). ε-greedy exploration.

AlgorithmTypical pathWhy
Q-learningAlong cliff edgeLearns optimal Q*; risky steps during explore hurt less in value of optimal policy
SARSASafer route inlandUpdates on actual risky a′ — learns π that accounts for ε-slipping

Checkpoint: You deploy with ε=0 (greedy only). Which algorithm’s Q table matches deployment?

Answer

Q-learning targets optimal greedy policy. SARSA Q reflects ε-soft behavior during training — with ε→0 and enough visits, both can converge to optimal on benign MDPs, but paths during learning differ.


ε-greedy behavior policy

python
def epsilon_greedy(Q, s, epsilon, n_actions, rng):
    if rng.random() < epsilon:
        return rng.integers(n_actions)
    return int(np.argmax(Q[s]))

Common schedule: ε starts 1.0 → decays to 0.05 over first 10k steps.


Convergence conditions (tabular)

Q-learning converges to Q* with:

  • All (s, a) visited infinitely often
  • α satisfies Robbins–Monro: Σα = ∞, Σα² < ∞ (or fixed small α in practice)
  • Standard MDP conditions

SARSA converges to Q^π for the behavior policy π (ε-soft).


Hyperparameter table

KnobToo lowToo high
αSlow learningOscillation / divergence
εStuck in local policyRandom walk, ignores learning
γMyopicSlow credit to distant goals

Common mistakes

  • Using max in SARSA (that is Q-learning).
  • Forgetting to set Q(terminal, ·) = 0 or skip max at terminal s′.
  • Argmax tie-breaking arbitrary — multiple optimal actions OK.
  • One update per episode — need many passes over transitions.

Q-learning is the direct ancestor of DQN. First understand on-policy vs off-policy deeply — the distinction drives safety, data reuse, and algorithm design.


Before this lesson


What's next