Markov decision processes
Before we begin
An MDP (Markov decision process) is the standard mathematical model for reinforcement learning. It packages states, actions, transition dynamics, rewards, and a discount factor into one object you can simulate, analyze, and solve — at least when the state space is small enough.
The Markov property is the key assumption: the future depends on the present state and action, not on the full history. If your state vector is missing something important, you do not have an MDP — you have a harder partially observable problem.
What you will learn
- Write the MDP tuple (S, A, P, R, γ) and explain each part.
- State the Markov property in words and notation.
- Read transition probabilities P(s′ | s, a).
- Connect MDPs to Gymnasium
observation_spaceandaction_space. - Build a tiny tabular MDP by hand.
The MDP tuple
| Symbol | Name | Meaning |
|---|---|---|
| S | State space | All possible states |
| A | Action space | Actions the agent can take |
| P | Transition kernel | P(s′ |
| R | Reward function | Expected immediate reward R(s, a) or R(s, a, s′) |
| γ | Discount factor | γ ∈ [0, 1] weights future reward |
A policy π(a | s) assigns action probabilities per state. The agent’s job: find a policy that maximizes expected cumulative reward.
The Markov property
In words: given the current state sₜ and action aₜ, the distribution over the next state sₜ₊₁ does not depend on earlier states or actions.
Why it matters: algorithms store values per state (or per state–action pair). If the state is not “sufficient,” two histories that look like the same observation can require different optimal actions — tabular methods will fail.
Fixing non-Markov observations
| Problem | Symptom | Fix |
|---|---|---|
| CartPole with position only | Same x, different velocity | Add velocity to state |
| Poker without card history | Same public cards, different hands | Include private cards or belief state |
| Stock price only | Momentum hidden | Add moving averages as features |
Checkpoint: You observe pixel frames from Atari. Is the single frame Markov?
Answer
Usually no — one frame lacks velocity. Standard fix: stack the last 4 frames so the state encodes motion. Deep Q-Networks do exactly this.
Worked example: 3-state line world
States: A (start), B (middle), C (goal). Actions: left, right. Deterministic transitions except from B: right goes to C with 0.8, slips back to A with 0.2.
| s | a | s′ | P(s′ | s, a) | R | |---|---|-----|---------------|---| | A | right | B | 1.0 | 0 | | B | right | C | 0.8 | 0 | | B | right | A | 0.2 | 0 | | C | any | C | 1.0 | +10 (terminal) |
Rewards: 0 per step, +10 entering C. This tiny MDP is enough to run value iteration later.
# Tabular transition dict: P[s][a] -> list of (prob, next_s, reward)
P = {
"A": {"right": [(1.0, "B", 0.0)]},
"B": {"right": [(0.8, "C", 0.0), (0.2, "A", 0.0)]},
"C": {"left": [(1.0, "C", 10.0)], "right": [(1.0, "C", 10.0)]},
}Stochastic vs deterministic environments
| Type | P(s′ | s, a) | Example | |------|---------------|---------| | Deterministic | One s′ with prob 1 | Gridworld without slip | | Stochastic | Multiple outcomes | Slippery ice, noisy actuators |
Expected next state is not enough for optimal control — you must plan over the distribution. Value functions average over that distribution.
Gymnasium spaces
import gymnasium as gym
env = gym.make("FrozenLake-v1")
print(env.observation_space) # Discrete(16) for 4x4 grid
print(env.action_space) # Discrete(4): left, down, right, up| Space type | MDP role |
|---|---|
Discrete(n) | Finite states or actions |
Box(low, high, shape) | Continuous vectors |
Dict(...) | Multi-part observations |
The environment implements P and R; your agent never hard-codes them in real problems — it learns from samples.
Common mistakes
- Using observation and state interchangeably in POMDP settings.
- Defining rewards that omit step penalties when you care about short paths — agent may wander forever.
- Assuming actions are always discrete — robotics and control use continuous action spaces (Module 8).
- Forgetting terminal states: transitions from terminal s should stop the episode.
MDPs turn “learning from interaction” into precise math. Next we define returns and discounting — the scalar objective every RL algorithm optimizes.