Bellman equations & value functions
Before we begin
A value function answers: “How good is it to be here, if I follow policy π?” Bellman equations decompose that question into immediate reward plus discounted value of what comes next. They are the backbone of dynamic programming, temporal-difference learning, and Q-learning.
Once you can read V^π and Q^π, papers and pseudocode become much easier to follow.
What you will learn
- Define state-value V^π(s) and action-value Q^π(s, a).
- Write Bellman expectation equations for V and Q.
- Write Bellman optimality equations for V* and Q*.
- Derive greedy policy from Q*: π*(s) = argmax_a Q*(s, a).
- Perform one manual Bellman backup on a tiny MDP.
State-value function V^π(s)
Definition: expected return starting in state s, following policy π thereafter.
V^π(s) = E_π[ Gₜ | Sₜ = s ]
Bellman expectation for V^π:
V^π(s) = Σ_a π(a|s) Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ V^π(s′) ]
In words: average over actions the policy takes, then over next states the environment might emit, immediate reward plus discounted value of s′.
Action-value function Q^π(s, a)
Definition: expected return starting in s, taking action a, then following π.
Q^π(s, a) = E_π[ Gₜ | Sₜ = s, Aₜ = a ]
Bellman expectation for Q^π:
Q^π(s, a) = Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ Σ_a′ π(a′|s′) Q^π(s′, a′) ]
| Function | “What if I…” |
|---|---|
| V^π(s) | …start here and follow π |
| Q^π(s, a) | …start here, force action a, then follow π |
Relation: V^π(s) = Σ_a π(a|s) Q^π(s, a)
Worked backup: one-state line
Two states L and R. In L, action go deterministically moves to R with reward 0. In R, any action gives reward +5 and stays in R (terminal). γ = 0.9, π always chooses go.
V^π(R) = 5 + γ·0 = 5 (absorbing terminal treated as no future)
V^π(L) = 0 + γ V^π(R) = 0.9 × 5 = 4.5
gamma = 0.9
V_R = 5.0
V_L = 0 + gamma * V_R
print("V(L) =", V_L) # 4.5Checkpoint: If γ = 0, what is V^π(L)?
Answer
V^π(L) = 0 + 0 × V^π(R) = 0 — only immediate reward counts.
Optimal value functions and Bellman optimality
V(s)* — best expected return achievable from s under any policy.
Bellman optimality for V:*
V*(s) = max_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ V*(s′) ]
Q form:*
Q*(s, a) = Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ max_a′ Q*(s′, a′) ]
Greedy policy: π*(s) = argmax_a Q*(s, a) — pick the action with highest optimal Q.
| Object | Bellman uses |
|---|---|
| V^π, Q^π | Expectation over π |
| V*, Q* | max over actions (optimal) |
Why Bellman matters for algorithms
| Algorithm family | Uses Bellman as… |
|---|---|
| Policy iteration | Exact solve on tabular MDP |
| Value iteration | Repeated V* backup |
| Q-learning | Sampled Q* backup with max |
| SARSA | Sampled Q^π backup with actual next action |
TD error δ = r + γ V(s′) − V(s) is a sample of the Bellman residual.
Common mistakes
- Using max in backups when evaluating a fixed policy (that is optimality, not expectation).
- Forgetting that V(terminal) is often 0 by convention when no further reward.
- Confusing Q(s, a) for one action with V(s) which averages over π.
- Assuming Bellman backups always converge — function approximation can diverge (Module 3).
Bellman equations turn “long-term goodness” into local, recursive updates. Module 2 uses them in dynamic programming and temporal-difference learning to learn V, Q, and optimal policies from data.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.