← Back to curriculum

Module 2 — Tabular methods

Monte Carlo methods

First-visit and every-visit MC, episodic returns, and MC control with ε-soft policies.

~65 min read + exercises

Monte Carlo methods

Before we begin

Monte Carlo (MC) methods learn value functions from complete episodes: wait until termination, compute actual return Gₜ, and average returns seen from each state or state–action pair. No bootstrapping, no model — just experience.

MC is high-variance but unbiased for V^π under the policy that generated the data. It clarifies what “learning from episodes” means before we add TD bootstrapping.


What you will learn

  • Estimate V^π(s) and Q^π(s, a) from episode samples.
  • Distinguish first-visit vs every-visit MC.
  • Implement MC policy evaluation for a random policy.
  • Understand exploring starts and why coverage matters.
  • Compare MC variance to TD methods preview.

MC policy evaluation

For each episode under policy π:

  1. Record visits to states (and actions for Q).
  2. For each visit at time t, compute Gₜ from t to episode end.
  3. Average all Gₜ for that state.

First-visit MC: only count the first time s appears in an episode.
Every-visit MC: count all timesteps where Sₜ = s.

VariantBiasTypical use
First-visitUnbiasedTextbook default
Every-visitSlight bias earlyOften similar asymptotically

Worked example

Episode rewards from t=0: (0, 0, 1), γ = 1, states (A, B, B).

  • From first visit A at t=0: G₀ = 0+0+1 = 1
  • From first visit B at t=1: G₁ = 0+1 = 1
  • Every-visit B also gets G₂ = 1 at t=2

If this is the only episode, first-visit V(A)=1, V(B)=1.

python
def first_visit_mc(episodes, gamma=1.0):
  """episodes: list of (states, rewards) per episode."""
  returns_sum = {}
  returns_count = {}
  for states, rewards in episodes:
      G = 0
      visited = set()
      T = len(rewards)
      for t in reversed(range(T)):
          G = rewards[t] + gamma * G
          s = states[t]
          if s not in visited:
              visited.add(s)
              returns_sum[s] = returns_sum.get(s, 0) + G
              returns_count[s] = returns_count.get(s, 0) + 1
  return {s: returns_sum[s] / returns_count[s] for s in returns_sum}

MC for Q^π and control

Store returns per (s, a) pair when action a was taken at s. MC control improves policy toward greedy Q, but need exploring starts (every s, a possible at episode start) or ε-soft policies so every pair is visited infinitely often.

Checkpoint: Why can MC not update mid-episode in continuing tasks without modifications?

Answer

Return Gₜ is defined to episode end — no natural termination in continuing problems. Fixes: use discounted returns over long windows, temporal-difference methods, or average-reward formulation.


On-policy MC control sketch

  1. Generate episode with ε-soft π.
  2. For each (s, a) in episode, Gₜ ← return from first visit.
  3. Q(s, a) ← average of returns.
  4. π(s) ← ε-greedy with respect to Q.

Variance and sample efficiency

PropertyMC
BiasNone (for V^π of generating π)
VarianceHigh — full trajectory noise
Update timingEnd of episode only
BootstrappingNo

Long episodes dilute credit — reward at step 500 affects G₀ strongly even if early actions mattered more.


Common mistakes

  • Averaging returns without enough visits — V(s) stays noisy.
  • Evaluating π while data comes from a different behavior policy without importance sampling (off-policy MC is advanced).
  • Forgetting to process episodes backward when computing Gₜ efficiently.
  • Using MC on non-episodic tasks without defining episode boundaries.

Monte Carlo uses full returns; temporal-difference learning updates after every step by bootstrapping — lower variance, introducing slight bias, and the foundation of Q-learning.


Before this lesson


What's next