← Back to curriculum

Module 1 — RL foundations & MDPs

Bellman equations & value functions

V^π, Q^π, Bellman expectation and optimality, and greedy policies from Q*.

~65 min read + exercises

Bellman equations & value functions

Before we begin

A value function answers: “How good is it to be here, if I follow policy π?” Bellman equations decompose that question into immediate reward plus discounted value of what comes next. They are the backbone of dynamic programming, temporal-difference learning, and Q-learning.

Once you can read V^π and Q^π, papers and pseudocode become much easier to follow.


What you will learn

  • Define state-value V^π(s) and action-value Q^π(s, a).
  • Write Bellman expectation equations for V and Q.
  • Write Bellman optimality equations for V* and Q*.
  • Derive greedy policy from Q*: π*(s) = argmax_a Q*(s, a).
  • Perform one manual Bellman backup on a tiny MDP.

State-value function V^π(s)

Definition: expected return starting in state s, following policy π thereafter.

V^π(s) = E_π[ Gₜ | Sₜ = s ]

Bellman expectation for V^π:

V^π(s) = Σ_a π(a|s) Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ V^π(s′) ]

In words: average over actions the policy takes, then over next states the environment might emit, immediate reward plus discounted value of s′.


Action-value function Q^π(s, a)

Definition: expected return starting in s, taking action a, then following π.

Q^π(s, a) = E_π[ Gₜ | Sₜ = s, Aₜ = a ]

Bellman expectation for Q^π:

Q^π(s, a) = Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ Σ_a′ π(a′|s′) Q^π(s′, a′) ]

Function“What if I…”
V^π(s)…start here and follow π
Q^π(s, a)…start here, force action a, then follow π

Relation: V^π(s) = Σ_a π(a|s) Q^π(s, a)


Worked backup: one-state line

Two states L and R. In L, action go deterministically moves to R with reward 0. In R, any action gives reward +5 and stays in R (terminal). γ = 0.9, π always chooses go.

V^π(R) = 5 + γ·0 = 5 (absorbing terminal treated as no future)

V^π(L) = 0 + γ V^π(R) = 0.9 × 5 = 4.5

python
gamma = 0.9
V_R = 5.0
V_L = 0 + gamma * V_R
print("V(L) =", V_L)  # 4.5

Checkpoint: If γ = 0, what is V^π(L)?

Answer

V^π(L) = 0 + 0 × V^π(R) = 0 — only immediate reward counts.


Optimal value functions and Bellman optimality

V(s)* — best expected return achievable from s under any policy.

Bellman optimality for V:*

V*(s) = max_a Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ V*(s′) ]

Q form:*

Q*(s, a) = Σ_s′ P(s′|s,a) [ R(s,a,s′) + γ max_a′ Q*(s′, a′) ]

Greedy policy: π*(s) = argmax_a Q*(s, a) — pick the action with highest optimal Q.

ObjectBellman uses
V^π, Q^πExpectation over π
V*, Q*max over actions (optimal)

Why Bellman matters for algorithms

Algorithm familyUses Bellman as…
Policy iterationExact solve on tabular MDP
Value iterationRepeated V* backup
Q-learningSampled Q* backup with max
SARSASampled Q^π backup with actual next action

TD error δ = r + γ V(s′) − V(s) is a sample of the Bellman residual.


Common mistakes

  • Using max in backups when evaluating a fixed policy (that is optimality, not expectation).
  • Forgetting that V(terminal) is often 0 by convention when no further reward.
  • Confusing Q(s, a) for one action with V(s) which averages over π.
  • Assuming Bellman backups always converge — function approximation can diverge (Module 3).

Bellman equations turn “long-term goodness” into local, recursive updates. Module 2 uses them in dynamic programming and temporal-difference learning to learn V, Q, and optimal policies from data.


Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.