Monte Carlo methods
Before we begin
Monte Carlo (MC) methods learn value functions from complete episodes: wait until termination, compute actual return Gₜ, and average returns seen from each state or state–action pair. No bootstrapping, no model — just experience.
MC is high-variance but unbiased for V^π under the policy that generated the data. It clarifies what “learning from episodes” means before we add TD bootstrapping.
What you will learn
- Estimate V^π(s) and Q^π(s, a) from episode samples.
- Distinguish first-visit vs every-visit MC.
- Implement MC policy evaluation for a random policy.
- Understand exploring starts and why coverage matters.
- Compare MC variance to TD methods preview.
MC policy evaluation
For each episode under policy π:
- Record visits to states (and actions for Q).
- For each visit at time t, compute Gₜ from t to episode end.
- Average all Gₜ for that state.
First-visit MC: only count the first time s appears in an episode.
Every-visit MC: count all timesteps where Sₜ = s.
| Variant | Bias | Typical use |
|---|---|---|
| First-visit | Unbiased | Textbook default |
| Every-visit | Slight bias early | Often similar asymptotically |
Worked example
Episode rewards from t=0: (0, 0, 1), γ = 1, states (A, B, B).
- From first visit A at t=0: G₀ = 0+0+1 = 1
- From first visit B at t=1: G₁ = 0+1 = 1
- Every-visit B also gets G₂ = 1 at t=2
If this is the only episode, first-visit V(A)=1, V(B)=1.
def first_visit_mc(episodes, gamma=1.0):
"""episodes: list of (states, rewards) per episode."""
returns_sum = {}
returns_count = {}
for states, rewards in episodes:
G = 0
visited = set()
T = len(rewards)
for t in reversed(range(T)):
G = rewards[t] + gamma * G
s = states[t]
if s not in visited:
visited.add(s)
returns_sum[s] = returns_sum.get(s, 0) + G
returns_count[s] = returns_count.get(s, 0) + 1
return {s: returns_sum[s] / returns_count[s] for s in returns_sum}MC for Q^π and control
Store returns per (s, a) pair when action a was taken at s. MC control improves policy toward greedy Q, but need exploring starts (every s, a possible at episode start) or ε-soft policies so every pair is visited infinitely often.
Checkpoint: Why can MC not update mid-episode in continuing tasks without modifications?
Answer
Return Gₜ is defined to episode end — no natural termination in continuing problems. Fixes: use discounted returns over long windows, temporal-difference methods, or average-reward formulation.
On-policy MC control sketch
- Generate episode with ε-soft π.
- For each (s, a) in episode, Gₜ ← return from first visit.
- Q(s, a) ← average of returns.
- π(s) ← ε-greedy with respect to Q.
Variance and sample efficiency
| Property | MC |
|---|---|
| Bias | None (for V^π of generating π) |
| Variance | High — full trajectory noise |
| Update timing | End of episode only |
| Bootstrapping | No |
Long episodes dilute credit — reward at step 500 affects G₀ strongly even if early actions mattered more.
Common mistakes
- Averaging returns without enough visits — V(s) stays noisy.
- Evaluating π while data comes from a different behavior policy without importance sampling (off-policy MC is advanced).
- Forgetting to process episodes backward when computing Gₜ efficiently.
- Using MC on non-episodic tasks without defining episode boundaries.
Monte Carlo uses full returns; temporal-difference learning updates after every step by bootstrapping — lower variance, introducing slight bias, and the foundation of Q-learning.