Dynamic programming — policy & value iteration

Before we begin

Dynamic programming (DP) solves MDPs when you know the model P and R — no learning from samples required. Policy evaluation computes V^π; policy improvement makes π better; policy iteration alternates until optimal. Value iteration skips explicit policy evaluation and backs up V* directly.

Real RL rarely has a full model, but DP is the cleanest place to see Bellman backups work exactly.

What you will learn

Implement policy evaluation for a tabular MDP.
Perform policy improvement via greedy action selection.
Run policy iteration and value iteration on a grid.
Compare iteration counts and when each is preferred.
Connect DP to later sample-based methods (TD, Q-learning).

Policy evaluation

Given fixed π, iterate until V converges:

V(s) ← Σ_a π(a|s) Σ_s′ P(s′|s,a) [ R + γ V(s′) ]

python

import numpy as np
 
def policy_eval(P, pi, gamma=0.9, theta=1e-8):
    V = np.zeros(len(P))
    while True:
        delta = 0
        for s in range(len(P)):
            v = 0
            for a, prob_a in enumerate(pi[s]):
                if prob_a == 0:
                    continue
                for p, s_next, r in P[s][a]:
                    v += prob_a * p * (r + gamma * V[s_next])
            delta = max(delta, abs(v - V[s]))
            V[s] = v
        if delta < theta:
            break
    return V

Policy improvement

After evaluation, be greedy:

π(s) ← argmax_a Σ_s′ P(s′|s,a) [ R + γ V(s′) ]

If new π is not strictly better at every state, π is optimal (policy improvement theorem).

Tiny example

State	V after eval	Best action by one-step lookahead
0	2.0	right (lookahead 2.7)
1	5.0	stay (terminal)

Policy iteration vs value iteration

Method	Inner loop	Outer loop
Policy iteration	Full policy eval	Greedy improvement
Value iteration	One Bellman optimality sweep	Repeat until ‖V−V_new‖ small

Value iteration backup:

V(s) ← max_a Σ_s′ P(s′|s,a) [ R + γ V(s′) ]

python

def value_iteration(P, gamma=0.9, theta=1e-8):
    V = np.zeros(len(P))
    policy = np.zeros(len(P), dtype=int)
    while True:
        delta = 0
        for s in range(len(P)):
            q_sa = []
            for a in range(len(P[s])):
                q = sum(p * (r + gamma * V[s_next]) for p, s_next, r in P[s][a])
                q_sa.append(q)
            best = max(q_sa)
            delta = max(delta, abs(best - V[s]))
            V[s] = best
            policy[s] = int(np.argmax(q_sa))
        if delta < theta:
            break
    return V, policy

Checkpoint: Value iteration stops after one sweep per “iteration.” Policy iteration runs many sweeps per policy. Which often has fewer outer loops on small grids?

Answer

Policy iteration — full evaluation gives exact V^π each round, so greedy policy jumps can be fewer. Value iteration does cheap partial updates but more outer iterations. Total work varies by problem.

Gridworld intuition

4×4 grid, start top-left, goal bottom-right, reward −1 per step, γ = 0.9. Optimal policy points toward goal along shortest paths. Walls block P(s′|s,a).

Plot V as heatmap: values increase toward goal; corners far from goal are more negative.

Model-based vs model-free

DP	TD / Q-learning
Needs P, R	Learns from samples only
Exact backups	Stochastic updates
Planning	Interaction

Dyna-Q (Module 7) blends both — learn a model, plan with DP-style updates.

Common mistakes

Running policy improvement before V^π has converged — can still work with bounded error, but sloppy eval hurts.
Using max during evaluation of a stochastic policy without averaging over π’s actions.
Off-by-one in terminal states — backups should not propagate value through true terminals incorrectly.
Assuming DP scales — state space explosion hits real problems (motivation for function approximation).

DP gives exact solutions for tabular MDPs with known dynamics. When the model is unknown, Monte Carlo methods estimate values from complete episode returns — our next topic.

Before this lesson

Previous lesson

What's next

Next lesson — Monte Carlo methods