Dynamic programming — policy & value iteration
Before we begin
Dynamic programming (DP) solves MDPs when you know the model P and R — no learning from samples required. Policy evaluation computes V^π; policy improvement makes π better; policy iteration alternates until optimal. Value iteration skips explicit policy evaluation and backs up V* directly.
Real RL rarely has a full model, but DP is the cleanest place to see Bellman backups work exactly.
What you will learn
- Implement policy evaluation for a tabular MDP.
- Perform policy improvement via greedy action selection.
- Run policy iteration and value iteration on a grid.
- Compare iteration counts and when each is preferred.
- Connect DP to later sample-based methods (TD, Q-learning).
Policy evaluation
Given fixed π, iterate until V converges:
V(s) ← Σ_a π(a|s) Σ_s′ P(s′|s,a) [ R + γ V(s′) ]
import numpy as np
def policy_eval(P, pi, gamma=0.9, theta=1e-8):
V = np.zeros(len(P))
while True:
delta = 0
for s in range(len(P)):
v = 0
for a, prob_a in enumerate(pi[s]):
if prob_a == 0:
continue
for p, s_next, r in P[s][a]:
v += prob_a * p * (r + gamma * V[s_next])
delta = max(delta, abs(v - V[s]))
V[s] = v
if delta < theta:
break
return VPolicy improvement
After evaluation, be greedy:
π(s) ← argmax_a Σ_s′ P(s′|s,a) [ R + γ V(s′) ]
If new π is not strictly better at every state, π is optimal (policy improvement theorem).
Tiny example
| State | V after eval | Best action by one-step lookahead |
|---|---|---|
| 0 | 2.0 | right (lookahead 2.7) |
| 1 | 5.0 | stay (terminal) |
Policy iteration vs value iteration
| Method | Inner loop | Outer loop |
|---|---|---|
| Policy iteration | Full policy eval | Greedy improvement |
| Value iteration | One Bellman optimality sweep | Repeat until ‖V−V_new‖ small |
Value iteration backup:
V(s) ← max_a Σ_s′ P(s′|s,a) [ R + γ V(s′) ]
def value_iteration(P, gamma=0.9, theta=1e-8):
V = np.zeros(len(P))
policy = np.zeros(len(P), dtype=int)
while True:
delta = 0
for s in range(len(P)):
q_sa = []
for a in range(len(P[s])):
q = sum(p * (r + gamma * V[s_next]) for p, s_next, r in P[s][a])
q_sa.append(q)
best = max(q_sa)
delta = max(delta, abs(best - V[s]))
V[s] = best
policy[s] = int(np.argmax(q_sa))
if delta < theta:
break
return V, policyCheckpoint: Value iteration stops after one sweep per “iteration.” Policy iteration runs many sweeps per policy. Which often has fewer outer loops on small grids?
Answer
Policy iteration — full evaluation gives exact V^π each round, so greedy policy jumps can be fewer. Value iteration does cheap partial updates but more outer iterations. Total work varies by problem.
Gridworld intuition
4×4 grid, start top-left, goal bottom-right, reward −1 per step, γ = 0.9. Optimal policy points toward goal along shortest paths. Walls block P(s′|s,a).
Plot V as heatmap: values increase toward goal; corners far from goal are more negative.
Model-based vs model-free
| DP | TD / Q-learning |
|---|---|
| Needs P, R | Learns from samples only |
| Exact backups | Stochastic updates |
| Planning | Interaction |
Dyna-Q (Module 7) blends both — learn a model, plan with DP-style updates.
Common mistakes
- Running policy improvement before V^π has converged — can still work with bounded error, but sloppy eval hurts.
- Using max during evaluation of a stochastic policy without averaging over π’s actions.
- Off-by-one in terminal states — backups should not propagate value through true terminals incorrectly.
- Assuming DP scales — state space explosion hits real problems (motivation for function approximation).
DP gives exact solutions for tabular MDPs with known dynamics. When the model is unknown, Monte Carlo methods estimate values from complete episode returns — our next topic.