← Back to curriculum

Module 7 — Model-based RL

Monte Carlo tree search

Selection, expansion, simulation, backprop; UCT; AlphaGo connection.

~70 min read + exercises

Monte Carlo tree search

Before we begin

Monte Carlo Tree Search (MCTS) powers superhuman play in Go, chess, and many discrete-action RL benchmarks. Unlike Dyna-Q's flat replay, MCTS builds a search tree from the current state, balances exploration vs exploitation along branches, and returns an action backed by thousands of simulated rollouts. Combined with learned value networks (AlphaGo) or learned models (MuZero), MCTS remains the gold standard for planning in large discrete spaces.

MCTS — iteratively expand a tree, simulate to the end or to a depth limit, backpropagate returns.
UCT — selection rule balancing visit counts and value estimates.
Rollout policy — default policy after leaving the tree (random or heuristic).


What you will learn

  • Walk through the four MCTS phases: select, expand, simulate, backpropagate.
  • Apply UCT (Upper Confidence bounds applied to Trees) for action selection.
  • Connect MCTS to model-free policy when rollouts use the real env or a model.
  • Contrast MCTS with brute-force planning and Dyna-Q.
  • See where MCTS fits in games vs robotics (discrete vs continuous).

The four phases

Each MCTS iteration (repeat N times, e.g. 800):

  1. Selection — start at root; follow UCT child until reaching a node not fully expanded.
  2. Expansion — add one child for an untried action.
  3. Simulation (rollout) — play out random or policy actions until terminal or depth limit; record return G.
  4. Backpropagation — add G to visit count and value sum for every node on the path.

After N iterations, pick the action at the root with highest visit count (or highest mean value).

UCT formula (intuition)

For child i with mean value Q̄ᵢ and visit count nᵢ, parent visit count N:

UCTᵢ = Q̄ᵢ + c · √(ln N / nᵢ)

The second term grows for rarely visited children → exploration. c ≈ √2 is common. This is the same explore/exploit tension as UCB bandits, applied along a tree.


Worked example: tic-tac-toe node

Root: empty board, 9 possible moves. After 200 iterations:

ActionVisitsMean valueUCT (c=1.4)
Center800.550.55 + 0.19 = 0.74
Corner450.600.60 + 0.26 = 0.86
Edge300.400.40 + 0.31 = 0.71

Corner wins on visits × value trade-off. Checkpoint: Why pick max visits at root instead of max mean value?

Answer

Visit counts aggregate robustness — an action with high mean from 2 lucky rollouts is less trustworthy than one with slightly lower mean but 80 visits. Max-visits is more stable at the root, especially when simulation noise is high.


Pseudocode sketch

python
class Node:
  def __init__(self, state, parent=None):
    self.state = state
    self.parent = parent
    self.children = {}  # action -> Node
    self.N = 0
    self.W = 0.0  # total value
 
  def uct(self, action, c=1.4):
    child = self.children[action]
    if child.N == 0:
      return float("inf")
    q = child.W / child.N
    return q + c * math.sqrt(math.log(self.N) / child.N)
 
def mcts(root_state, n_iter=500):
    root = Node(root_state)
    for _ in range(n_iter):
        node = root
        # SELECT + EXPAND
        while node.children and node.is_fully_expanded():
            a = max(node.children.keys(), key=lambda ac: node.uct(ac))
            node = node.children[a]
        if not node.is_terminal():
            a = node.untried_action()
            s_next = env.step_from(node.state, a)
            child = Node(s_next, parent=node)
            node.children[a] = child
            node = child
        # SIMULATE
        G = rollout(node.state)
        # BACKPROP
        while node:
            node.N += 1
            node.W += G
            node = node.parent
    return max(root.children, key=lambda a: root.children[a].N)

MCTS + learned components

SystemWhat is learnedWhat MCTS does
AlphaGoPolicy prior P(as), value V(s)
MuZeroDynamics + value + policy in latent spaceMCTS in learned model
Plain MCTSNothingRandom rollouts to terminal

Learned policy prior sharpens expansion toward plausible moves. Learned value replaces long random rollouts — evaluate leaf with V(s) instead of simulating 200 steps.


MCTS vs other planners

MethodAction spaceNeeds model?Typical use
Dyna-QSmall discreteYes (one-step)Tabular, fast updates
MCTSDiscrete, moderate branchingRollout or modelBoard games, planning
MPC / CEMContinuousOne-step modelRobotics
Value iterationSmall SFull P(s′s,a)

MCTS struggles with continuous or huge branching (thousands of actions) without discretization or learned priors.


Common mistakes

MistakeSymptomFix
Too few iterationsNoisy action choiceScale N with branching factor
Random rollouts in deep gamesWeak leaf valuesLearned value network
Reusing tree after unrelated state jumpInvalid statisticsReset tree each step or use move history
Maximizing mean at root with few simsOverfits luckPrefer visit count
Ignoring opponent moves in two-playerWrong game treeAlternate plies / expectimax

Closing

MCTS turns many weak rollouts into a strong action by structuring search in a tree and allocating simulations where they matter. It is the planning backbone for discrete decision-making at scale; combine it with learned models and values when raw simulation is too expensive — the pattern behind MuZero and modern game-playing agents.


Before this lesson


What's next