← Back to curriculum

Module 2 — Tabular methods

Dynamic programming — policy & value iteration

Policy evaluation, policy iteration, value iteration, and when DP is tractable.

~70 min read + exercises

Dynamic programming — policy & value iteration

Before we begin

Dynamic programming (DP) solves MDPs when you know the model P and R — no learning from samples required. Policy evaluation computes V^π; policy improvement makes π better; policy iteration alternates until optimal. Value iteration skips explicit policy evaluation and backs up V* directly.

Real RL rarely has a full model, but DP is the cleanest place to see Bellman backups work exactly.


What you will learn

  • Implement policy evaluation for a tabular MDP.
  • Perform policy improvement via greedy action selection.
  • Run policy iteration and value iteration on a grid.
  • Compare iteration counts and when each is preferred.
  • Connect DP to later sample-based methods (TD, Q-learning).

Policy evaluation

Given fixed π, iterate until V converges:

V(s) ← Σ_a π(a|s) Σ_s′ P(s′|s,a) [ R + γ V(s′) ]

python
import numpy as np
 
def policy_eval(P, pi, gamma=0.9, theta=1e-8):
    V = np.zeros(len(P))
    while True:
        delta = 0
        for s in range(len(P)):
            v = 0
            for a, prob_a in enumerate(pi[s]):
                if prob_a == 0:
                    continue
                for p, s_next, r in P[s][a]:
                    v += prob_a * p * (r + gamma * V[s_next])
            delta = max(delta, abs(v - V[s]))
            V[s] = v
        if delta < theta:
            break
    return V

Policy improvement

After evaluation, be greedy:

π(s) ← argmax_a Σ_s′ P(s′|s,a) [ R + γ V(s′) ]

If new π is not strictly better at every state, π is optimal (policy improvement theorem).

Tiny example

StateV after evalBest action by one-step lookahead
02.0right (lookahead 2.7)
15.0stay (terminal)

Policy iteration vs value iteration

MethodInner loopOuter loop
Policy iterationFull policy evalGreedy improvement
Value iterationOne Bellman optimality sweepRepeat until ‖V−V_new‖ small

Value iteration backup:

V(s) ← max_a Σ_s′ P(s′|s,a) [ R + γ V(s′) ]

python
def value_iteration(P, gamma=0.9, theta=1e-8):
    V = np.zeros(len(P))
    policy = np.zeros(len(P), dtype=int)
    while True:
        delta = 0
        for s in range(len(P)):
            q_sa = []
            for a in range(len(P[s])):
                q = sum(p * (r + gamma * V[s_next]) for p, s_next, r in P[s][a])
                q_sa.append(q)
            best = max(q_sa)
            delta = max(delta, abs(best - V[s]))
            V[s] = best
            policy[s] = int(np.argmax(q_sa))
        if delta < theta:
            break
    return V, policy

Checkpoint: Value iteration stops after one sweep per “iteration.” Policy iteration runs many sweeps per policy. Which often has fewer outer loops on small grids?

Answer

Policy iteration — full evaluation gives exact V^π each round, so greedy policy jumps can be fewer. Value iteration does cheap partial updates but more outer iterations. Total work varies by problem.


Gridworld intuition

4×4 grid, start top-left, goal bottom-right, reward −1 per step, γ = 0.9. Optimal policy points toward goal along shortest paths. Walls block P(s′|s,a).

Plot V as heatmap: values increase toward goal; corners far from goal are more negative.


Model-based vs model-free

DPTD / Q-learning
Needs P, RLearns from samples only
Exact backupsStochastic updates
PlanningInteraction

Dyna-Q (Module 7) blends both — learn a model, plan with DP-style updates.


Common mistakes

  • Running policy improvement before V^π has converged — can still work with bounded error, but sloppy eval hurts.
  • Using max during evaluation of a stochastic policy without averaging over π’s actions.
  • Off-by-one in terminal states — backups should not propagate value through true terminals incorrectly.
  • Assuming DP scales — state space explosion hits real problems (motivation for function approximation).

DP gives exact solutions for tabular MDPs with known dynamics. When the model is unknown, Monte Carlo methods estimate values from complete episode returns — our next topic.


Before this lesson


What's next