Bellman equations & value functions

Before we begin

A value function answers: “How good is it to be here, if I follow policy π?” Bellman equations decompose that question into immediate reward plus discounted value of what comes next. They are the backbone of dynamic programming, temporal-difference learning, and Q-learning.

Once you can read V^π and Q^π, papers and pseudocode become much easier to follow.

What you will learn

Define state-value V^π(s) and action-value Q^π(s, a).
Write Bellman expectation equations for V and Q.
Write Bellman optimality equations for V* and Q*.
Derive greedy policy from Q*: π*(s) = argmax_a Q*(s, a).
Perform one manual Bellman backup on a tiny MDP.

State-value function V^π(s)

Definition: expected return starting in state s, following policy π thereafter.

V^π(s) = E_π[ Gₜ | Sₜ = s ]

Bellman expectation for V^π:

V^π(s) = Σ_a π(a|s) Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ V^π(s′) ]

In words: average over actions the policy takes, then over next states the environment might emit, immediate reward plus discounted value of s′.

Action-value function Q^π(s, a)

Definition: expected return starting in s, taking action a, then following π.

Q^π(s, a) = E_π[ Gₜ | Sₜ = s, Aₜ = a ]

Bellman expectation for Q^π:

Q^π(s, a) = Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ Σ_a′ π(a′|s′) Q^π(s′, a′) ]

Function	“What if I…”
V^π(s)	…start here and follow π
Q^π(s, a)	…start here, force action a, then follow π

Relation: V^π(s) = Σ_a π(a|s) Q^π(s, a)

Worked backup: one-state line

Two states L and R. In L, action go deterministically moves to R with reward 0. In R, any action gives reward +5 and stays in R (terminal). γ = 0.9, π always chooses go.

V^π(R) = 5 + γ·0 = 5 (absorbing terminal treated as no future)

V^π(L) = 0 + γ V^π(R) = 0.9 × 5 = 4.5

python

gamma = 0.9
V_R = 5.0
V_L = 0 + gamma * V_R
print("V(L) =", V_L)  # 4.5

Checkpoint: If γ = 0, what is V^π(L)?

Answer

V^π(L) = 0 + 0 × V^π(R) = 0 — only immediate reward counts.

Optimal value functions and Bellman optimality

V(s)* — best expected return achievable from s under any policy.

Bellman optimality for V:*

V*(s) = max_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ V*(s′) ]

Q form:*

Q*(s, a) = Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ max_a′ Q*(s′, a′) ]

Greedy policy: π*(s) = argmax_a Q*(s, a) — pick the action with highest optimal Q.

Object	Bellman uses
V^π, Q^π	Expectation over π
V, Q	max over actions (optimal)

Why Bellman matters for algorithms

Algorithm family	Uses Bellman as…
Policy iteration	Exact solve on tabular MDP
Value iteration	Repeated V* backup
Q-learning	Sampled Q* backup with max
SARSA	Sampled Q^π backup with actual next action

TD error δ = r + γ V(s′) − V(s) is a sample of the Bellman residual.

Common mistakes

Using max in backups when evaluating a fixed policy (that is optimality, not expectation).
Forgetting that V(terminal) is often 0 by convention when no further reward.
Confusing Q(s, a) for one action with V(s) which averages over π.
Assuming Bellman backups always converge — function approximation can diverge (Module 3).

Bellman equations turn “long-term goodness” into local, recursive updates. Module 2 uses them in dynamic programming and temporal-difference learning to learn V, Q, and optimal policies from data.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.