← Back to curriculum

Module 1 — RL foundations & MDPs

Markov decision processes

The MDP tuple, Markov property, transition dynamics, and mapping to Gymnasium spaces.

~60 min read + exercises

Markov decision processes

Before we begin

An MDP (Markov decision process) is the standard mathematical model for reinforcement learning. It packages states, actions, transition dynamics, rewards, and a discount factor into one object you can simulate, analyze, and solve — at least when the state space is small enough.

The Markov property is the key assumption: the future depends on the present state and action, not on the full history. If your state vector is missing something important, you do not have an MDP — you have a harder partially observable problem.


What you will learn

  • Write the MDP tuple (S, A, P, R, γ) and explain each part.
  • State the Markov property in words and notation.
  • Read transition probabilities P(s′ | s, a).
  • Connect MDPs to Gymnasium observation_space and action_space.
  • Build a tiny tabular MDP by hand.

The MDP tuple

SymbolNameMeaning
SState spaceAll possible states
AAction spaceActions the agent can take
PTransition kernelP(s′
RReward functionExpected immediate reward R(s, a) or R(s, a, s′)
γDiscount factorγ ∈ [0, 1] weights future reward

A policy π(a | s) assigns action probabilities per state. The agent’s job: find a policy that maximizes expected cumulative reward.


The Markov property

In words: given the current state sₜ and action aₜ, the distribution over the next state sₜ₊₁ does not depend on earlier states or actions.

Why it matters: algorithms store values per state (or per state–action pair). If the state is not “sufficient,” two histories that look like the same observation can require different optimal actions — tabular methods will fail.

Fixing non-Markov observations

ProblemSymptomFix
CartPole with position onlySame x, different velocityAdd velocity to state
Poker without card historySame public cards, different handsInclude private cards or belief state
Stock price onlyMomentum hiddenAdd moving averages as features

Checkpoint: You observe pixel frames from Atari. Is the single frame Markov?

Answer

Usually no — one frame lacks velocity. Standard fix: stack the last 4 frames so the state encodes motion. Deep Q-Networks do exactly this.


Worked example: 3-state line world

States: A (start), B (middle), C (goal). Actions: left, right. Deterministic transitions except from B: right goes to C with 0.8, slips back to A with 0.2.

| s | a | s′ | P(s′ | s, a) | R | |---|---|-----|---------------|---| | A | right | B | 1.0 | 0 | | B | right | C | 0.8 | 0 | | B | right | A | 0.2 | 0 | | C | any | C | 1.0 | +10 (terminal) |

Rewards: 0 per step, +10 entering C. This tiny MDP is enough to run value iteration later.

python
# Tabular transition dict: P[s][a] -> list of (prob, next_s, reward)
P = {
    "A": {"right": [(1.0, "B", 0.0)]},
    "B": {"right": [(0.8, "C", 0.0), (0.2, "A", 0.0)]},
    "C": {"left": [(1.0, "C", 10.0)], "right": [(1.0, "C", 10.0)]},
}

Stochastic vs deterministic environments

| Type | P(s′ | s, a) | Example | |------|---------------|---------| | Deterministic | One s′ with prob 1 | Gridworld without slip | | Stochastic | Multiple outcomes | Slippery ice, noisy actuators |

Expected next state is not enough for optimal control — you must plan over the distribution. Value functions average over that distribution.


Gymnasium spaces

python
import gymnasium as gym
 
env = gym.make("FrozenLake-v1")
print(env.observation_space)  # Discrete(16) for 4x4 grid
print(env.action_space)       # Discrete(4): left, down, right, up
Space typeMDP role
Discrete(n)Finite states or actions
Box(low, high, shape)Continuous vectors
Dict(...)Multi-part observations

The environment implements P and R; your agent never hard-codes them in real problems — it learns from samples.


Common mistakes

  • Using observation and state interchangeably in POMDP settings.
  • Defining rewards that omit step penalties when you care about short paths — agent may wander forever.
  • Assuming actions are always discrete — robotics and control use continuous action spaces (Module 8).
  • Forgetting terminal states: transitions from terminal s should stop the episode.

MDPs turn “learning from interaction” into precise math. Next we define returns and discounting — the scalar objective every RL algorithm optimizes.


Before this lesson


What's next