Monte Carlo tree search
Before we begin
Monte Carlo Tree Search (MCTS) powers superhuman play in Go, chess, and many discrete-action RL benchmarks. Unlike Dyna-Q's flat replay, MCTS builds a search tree from the current state, balances exploration vs exploitation along branches, and returns an action backed by thousands of simulated rollouts. Combined with learned value networks (AlphaGo) or learned models (MuZero), MCTS remains the gold standard for planning in large discrete spaces.
MCTS — iteratively expand a tree, simulate to the end or to a depth limit, backpropagate returns.
UCT — selection rule balancing visit counts and value estimates.
Rollout policy — default policy after leaving the tree (random or heuristic).
What you will learn
- Walk through the four MCTS phases: select, expand, simulate, backpropagate.
- Apply UCT (Upper Confidence bounds applied to Trees) for action selection.
- Connect MCTS to model-free policy when rollouts use the real env or a model.
- Contrast MCTS with brute-force planning and Dyna-Q.
- See where MCTS fits in games vs robotics (discrete vs continuous).
The four phases
Each MCTS iteration (repeat N times, e.g. 800):
- Selection — start at root; follow UCT child until reaching a node not fully expanded.
- Expansion — add one child for an untried action.
- Simulation (rollout) — play out random or policy actions until terminal or depth limit; record return G.
- Backpropagation — add G to visit count and value sum for every node on the path.
After N iterations, pick the action at the root with highest visit count (or highest mean value).
UCT formula (intuition)
For child i with mean value Q̄ᵢ and visit count nᵢ, parent visit count N:
UCTᵢ = Q̄ᵢ + c · √(ln N / nᵢ)
The second term grows for rarely visited children → exploration. c ≈ √2 is common. This is the same explore/exploit tension as UCB bandits, applied along a tree.
Worked example: tic-tac-toe node
Root: empty board, 9 possible moves. After 200 iterations:
| Action | Visits | Mean value | UCT (c=1.4) |
|---|---|---|---|
| Center | 80 | 0.55 | 0.55 + 0.19 = 0.74 |
| Corner | 45 | 0.60 | 0.60 + 0.26 = 0.86 |
| Edge | 30 | 0.40 | 0.40 + 0.31 = 0.71 |
Corner wins on visits × value trade-off. Checkpoint: Why pick max visits at root instead of max mean value?
Answer
Visit counts aggregate robustness — an action with high mean from 2 lucky rollouts is less trustworthy than one with slightly lower mean but 80 visits. Max-visits is more stable at the root, especially when simulation noise is high.
Pseudocode sketch
class Node:
def __init__(self, state, parent=None):
self.state = state
self.parent = parent
self.children = {} # action -> Node
self.N = 0
self.W = 0.0 # total value
def uct(self, action, c=1.4):
child = self.children[action]
if child.N == 0:
return float("inf")
q = child.W / child.N
return q + c * math.sqrt(math.log(self.N) / child.N)
def mcts(root_state, n_iter=500):
root = Node(root_state)
for _ in range(n_iter):
node = root
# SELECT + EXPAND
while node.children and node.is_fully_expanded():
a = max(node.children.keys(), key=lambda ac: node.uct(ac))
node = node.children[a]
if not node.is_terminal():
a = node.untried_action()
s_next = env.step_from(node.state, a)
child = Node(s_next, parent=node)
node.children[a] = child
node = child
# SIMULATE
G = rollout(node.state)
# BACKPROP
while node:
node.N += 1
node.W += G
node = node.parent
return max(root.children, key=lambda a: root.children[a].N)MCTS + learned components
| System | What is learned | What MCTS does |
|---|---|---|
| AlphaGo | Policy prior P(a | s), value V(s) |
| MuZero | Dynamics + value + policy in latent space | MCTS in learned model |
| Plain MCTS | Nothing | Random rollouts to terminal |
Learned policy prior sharpens expansion toward plausible moves. Learned value replaces long random rollouts — evaluate leaf with V(s) instead of simulating 200 steps.
MCTS vs other planners
| Method | Action space | Needs model? | Typical use |
|---|---|---|---|
| Dyna-Q | Small discrete | Yes (one-step) | Tabular, fast updates |
| MCTS | Discrete, moderate branching | Rollout or model | Board games, planning |
| MPC / CEM | Continuous | One-step model | Robotics |
| Value iteration | Small S | Full P(s′ | s,a) |
MCTS struggles with continuous or huge branching (thousands of actions) without discretization or learned priors.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Too few iterations | Noisy action choice | Scale N with branching factor |
| Random rollouts in deep games | Weak leaf values | Learned value network |
| Reusing tree after unrelated state jump | Invalid statistics | Reset tree each step or use move history |
| Maximizing mean at root with few sims | Overfits luck | Prefer visit count |
| Ignoring opponent moves in two-player | Wrong game tree | Alternate plies / expectimax |
Closing
MCTS turns many weak rollouts into a strong action by structuring search in a tree and allocating simulations where they matter. It is the planning backbone for discrete decision-making at scale; combine it with learned models and values when raw simulation is too expensive — the pattern behind MuZero and modern game-playing agents.