Markov decision processes

Before we begin

An MDP (Markov decision process) is the standard mathematical model for reinforcement learning. It packages states, actions, transition dynamics, rewards, and a discount factor into one object you can simulate, analyze, and solve — at least when the state space is small enough.

The Markov property is the key assumption: the future depends on the present state and action, not on the full history. If your state vector is missing something important, you do not have an MDP — you have a harder partially observable problem.

What you will learn

Write the MDP tuple (S, A, P, R, γ) and explain each part.
State the Markov property in words and notation.
Read transition probabilities P(s′ | s, a).
Connect MDPs to Gymnasium observation_space and action_space.
Build a tiny tabular MDP by hand.

The MDP tuple

Symbol	Name	Meaning
S	State space	All possible states
A	Action space	Actions the agent can take
P	Transition kernel	P(s′
R	Reward function	Expected immediate reward R(s, a) or R(s, a, s′)
γ	Discount factor	γ ∈ [0, 1] weights future reward

A policy π(a | s) assigns action probabilities per state. The agent’s job: find a policy that maximizes expected cumulative reward.

The Markov property

In words: given the current state sₜ and action aₜ, the distribution over the next state sₜ₊₁ does not depend on earlier states or actions.

Why it matters: algorithms store values per state (or per state–action pair). If the state is not “sufficient,” two histories that look like the same observation can require different optimal actions — tabular methods will fail.

Fixing non-Markov observations

Problem	Symptom	Fix
CartPole with position only	Same x, different velocity	Add velocity to state
Poker without card history	Same public cards, different hands	Include private cards or belief state
Stock price only	Momentum hidden	Add moving averages as features

Checkpoint: You observe pixel frames from Atari. Is the single frame Markov?

Answer

Usually no — one frame lacks velocity. Standard fix: stack the last 4 frames so the state encodes motion. Deep Q-Networks do exactly this.

Worked example: 3-state line world

States: A (start), B (middle), C (goal). Actions: left, right. Deterministic transitions except from B: right goes to C with 0.8, slips back to A with 0.2.

| s | a | s′ | P(s′ | s, a) | R | |---|---|-----|---------------|---| | A | right | B | 1.0 | 0 | | B | right | C | 0.8 | 0 | | B | right | A | 0.2 | 0 | | C | any | C | 1.0 | +10 (terminal) |

Rewards: 0 per step, +10 entering C. This tiny MDP is enough to run value iteration later.

python

# Tabular transition dict: P[s][a] -> list of (prob, next_s, reward)
P = {
    "A": {"right": [(1.0, "B", 0.0)]},
    "B": {"right": [(0.8, "C", 0.0), (0.2, "A", 0.0)]},
    "C": {"left": [(1.0, "C", 10.0)], "right": [(1.0, "C", 10.0)]},
}

Stochastic vs deterministic environments

| Type | P(s′ | s, a) | Example | |------|---------------|---------| | Deterministic | One s′ with prob 1 | Gridworld without slip | | Stochastic | Multiple outcomes | Slippery ice, noisy actuators |

Expected next state is not enough for optimal control — you must plan over the distribution. Value functions average over that distribution.

Gymnasium spaces

python

import gymnasium as gym
 
env = gym.make("FrozenLake-v1")
print(env.observation_space)  # Discrete(16) for 4x4 grid
print(env.action_space)       # Discrete(4): left, down, right, up

Space type	MDP role
`Discrete(n)`	Finite states or actions
`Box(low, high, shape)`	Continuous vectors
`Dict(...)`	Multi-part observations

The environment implements P and R; your agent never hard-codes them in real problems — it learns from samples.

Common mistakes

Using observation and state interchangeably in POMDP settings.
Defining rewards that omit step penalties when you care about short paths — agent may wander forever.
Assuming actions are always discrete — robotics and control use continuous action spaces (Module 8).
Forgetting terminal states: transitions from terminal s should stop the episode.

MDPs turn “learning from interaction” into precise math. Next we define returns and discounting — the scalar objective every RL algorithm optimizes.

Before this lesson

Previous lesson

What's next

Next lesson — Returns, discounting & episodes