Monte Carlo tree search

Before we begin

Monte Carlo Tree Search (MCTS) powers superhuman play in Go, chess, and many discrete-action RL benchmarks. Unlike Dyna-Q's flat replay, MCTS builds a search tree from the current state, balances exploration vs exploitation along branches, and returns an action backed by thousands of simulated rollouts. Combined with learned value networks (AlphaGo) or learned models (MuZero), MCTS remains the gold standard for planning in large discrete spaces.

MCTS — iteratively expand a tree, simulate to the end or to a depth limit, backpropagate returns.
UCT — selection rule balancing visit counts and value estimates.
Rollout policy — default policy after leaving the tree (random or heuristic).

What you will learn

Walk through the four MCTS phases: select, expand, simulate, backpropagate.
Apply UCT (Upper Confidence bounds applied to Trees) for action selection.
Connect MCTS to model-free policy when rollouts use the real env or a model.
Contrast MCTS with brute-force planning and Dyna-Q.
See where MCTS fits in games vs robotics (discrete vs continuous).

The four phases

Each MCTS iteration (repeat N times, e.g. 800):

Selection — start at root; follow UCT child until reaching a node not fully expanded.
Expansion — add one child for an untried action.
Simulation (rollout) — play out random or policy actions until terminal or depth limit; record return G.
Backpropagation — add G to visit count and value sum for every node on the path.

After N iterations, pick the action at the root with highest visit count (or highest mean value).

UCT formula (intuition)

For child i with mean value Q̄ᵢ and visit count nᵢ, parent visit count N:

UCTᵢ = Q̄ᵢ + c · √(ln N / nᵢ)

The second term grows for rarely visited children → exploration. c ≈ √2 is common. This is the same explore/exploit tension as UCB bandits, applied along a tree.

Worked example: tic-tac-toe node

Root: empty board, 9 possible moves. After 200 iterations:

Action	Visits	Mean value	UCT (c=1.4)
Center	80	0.55	0.55 + 0.19 = 0.74
Corner	45	0.60	0.60 + 0.26 = 0.86
Edge	30	0.40	0.40 + 0.31 = 0.71

Corner wins on visits × value trade-off. Checkpoint: Why pick max visits at root instead of max mean value?

Answer

Visit counts aggregate robustness — an action with high mean from 2 lucky rollouts is less trustworthy than one with slightly lower mean but 80 visits. Max-visits is more stable at the root, especially when simulation noise is high.

Pseudocode sketch

python

class Node:
  def __init__(self, state, parent=None):
    self.state = state
    self.parent = parent
    self.children = {}  # action -> Node
    self.N = 0
    self.W = 0.0  # total value
 
  def uct(self, action, c=1.4):
    child = self.children[action]
    if child.N == 0:
      return float("inf")
    q = child.W / child.N
    return q + c * math.sqrt(math.log(self.N) / child.N)
 
def mcts(root_state, n_iter=500):
    root = Node(root_state)
    for _ in range(n_iter):
        node = root
        # SELECT + EXPAND
        while node.children and node.is_fully_expanded():
            a = max(node.children.keys(), key=lambda ac: node.uct(ac))
            node = node.children[a]
        if not node.is_terminal():
            a = node.untried_action()
            s_next = env.step_from(node.state, a)
            child = Node(s_next, parent=node)
            node.children[a] = child
            node = child
        # SIMULATE
        G = rollout(node.state)
        # BACKPROP
        while node:
            node.N += 1
            node.W += G
            node = node.parent
    return max(root.children, key=lambda a: root.children[a].N)

MCTS + learned components

System	What is learned	What MCTS does
AlphaGo	Policy prior P(a	s), value V(s)
MuZero	Dynamics + value + policy in latent space	MCTS in learned model
Plain MCTS	Nothing	Random rollouts to terminal

Learned policy prior sharpens expansion toward plausible moves. Learned value replaces long random rollouts — evaluate leaf with V(s) instead of simulating 200 steps.

MCTS vs other planners

Method	Action space	Needs model?	Typical use
Dyna-Q	Small discrete	Yes (one-step)	Tabular, fast updates
MCTS	Discrete, moderate branching	Rollout or model	Board games, planning
MPC / CEM	Continuous	One-step model	Robotics
Value iteration	Small S	Full P(s′	s,a)

MCTS struggles with continuous or huge branching (thousands of actions) without discretization or learned priors.

Common mistakes

Mistake	Symptom	Fix
Too few iterations	Noisy action choice	Scale N with branching factor
Random rollouts in deep games	Weak leaf values	Learned value network
Reusing tree after unrelated state jump	Invalid statistics	Reset tree each step or use move history
Maximizing mean at root with few sims	Overfits luck	Prefer visit count
Ignoring opponent moves in two-player	Wrong game tree	Alternate plies / expectimax

Closing

MCTS turns many weak rollouts into a strong action by structuring search in a tree and allocating simulations where they matter. It is the planning backbone for discrete decision-making at scale; combine it with learned models and values when raw simulation is too expensive — the pattern behind MuZero and modern game-playing agents.

Before this lesson

Previous lesson

What's next

Next lesson — World models & Dreamer (intro)