Monte Carlo methods

Before we begin

Monte Carlo (MC) methods learn value functions from complete episodes: wait until termination, compute actual return Gₜ, and average returns seen from each state or state–action pair. No bootstrapping, no model — just experience.

MC is high-variance but unbiased for V^π under the policy that generated the data. It clarifies what “learning from episodes” means before we add TD bootstrapping.

What you will learn

Estimate V^π(s) and Q^π(s, a) from episode samples.
Distinguish first-visit vs every-visit MC.
Implement MC policy evaluation for a random policy.
Understand exploring starts and why coverage matters.
Compare MC variance to TD methods preview.

MC policy evaluation

For each episode under policy π:

Record visits to states (and actions for Q).
For each visit at time t, compute Gₜ from t to episode end.
Average all Gₜ for that state.

First-visit MC: only count the first time s appears in an episode.
Every-visit MC: count all timesteps where Sₜ = s.

Variant	Bias	Typical use
First-visit	Unbiased	Textbook default
Every-visit	Slight bias early	Often similar asymptotically

Worked example

Episode rewards from t=0: (0, 0, 1), γ = 1, states (A, B, B).

From first visit A at t=0: G₀ = 0+0+1 = 1
From first visit B at t=1: G₁ = 0+1 = 1
Every-visit B also gets G₂ = 1 at t=2

If this is the only episode, first-visit V(A)=1, V(B)=1.

python

def first_visit_mc(episodes, gamma=1.0):
  """episodes: list of (states, rewards) per episode."""
  returns_sum = {}
  returns_count = {}
  for states, rewards in episodes:
      G = 0
      visited = set()
      T = len(rewards)
      for t in reversed(range(T)):
          G = rewards[t] + gamma * G
          s = states[t]
          if s not in visited:
              visited.add(s)
              returns_sum[s] = returns_sum.get(s, 0) + G
              returns_count[s] = returns_count.get(s, 0) + 1
  return {s: returns_sum[s] / returns_count[s] for s in returns_sum}

MC for Q^π and control

Store returns per (s, a) pair when action a was taken at s. MC control improves policy toward greedy Q, but need exploring starts (every s, a possible at episode start) or ε-soft policies so every pair is visited infinitely often.

Checkpoint: Why can MC not update mid-episode in continuing tasks without modifications?

Answer

Return Gₜ is defined to episode end — no natural termination in continuing problems. Fixes: use discounted returns over long windows, temporal-difference methods, or average-reward formulation.

On-policy MC control sketch

Generate episode with ε-soft π.
For each (s, a) in episode, Gₜ ← return from first visit.
Q(s, a) ← average of returns.
π(s) ← ε-greedy with respect to Q.

Variance and sample efficiency

Property	MC
Bias	None (for V^π of generating π)
Variance	High — full trajectory noise
Update timing	End of episode only
Bootstrapping	No

Long episodes dilute credit — reward at step 500 affects G₀ strongly even if early actions mattered more.

Common mistakes

Averaging returns without enough visits — V(s) stays noisy.
Evaluating π while data comes from a different behavior policy without importance sampling (off-policy MC is advanced).
Forgetting to process episodes backward when computing Gₜ efficiently.
Using MC on non-episodic tasks without defining episode boundaries.

Monte Carlo uses full returns; temporal-difference learning updates after every step by bootstrapping — lower variance, introducing slight bias, and the foundation of Q-learning.

Before this lesson

Previous lesson

What's next

Next lesson — Temporal-difference learning