← Back to curriculum

Module 7 — Model-based RL

Dyna-Q & simulation

Integrate model learning with Q-learning; planning steps per real step.

~65 min read + exercises

Dyna-Q & simulation

Before we begin

Dyna-Q is the cleanest bridge between tabular Q-learning and model-based RL. You learn a model from real experience, then perform additional Q-learning updates on simulated experience drawn from that model — all without extra environment steps. Sutton's Dyna architecture shows how planning can be a few lines of code on top of standard TD learning.

Dyna-Q — after each real step, update the model, then run k planning steps (simulated transitions + Q backups).
Direct RL — learning from real experience only.
Indirect RL — learning from simulated experience via the model.


What you will learn

  • Implement the Dyna-Q loop: act, learn model, plan k steps.
  • Build a tabular model as dictionaries for P(s′|s,a) and R(s,a).
  • Tune n_planning_steps (k) and see speed vs accuracy trade-offs.
  • Connect Dyna to experience replay in deep RL (same idea, neural Q).
  • Diagnose when planning hurts because the model is wrong.

The Dyna-Q algorithm

On each real transition (s, a, r, s′):

  1. Direct RL: standard Q-learning backup on (s, a, r, s′).
  2. Model learning: record that (s, a) leads to (s′, r).
  3. Planning: repeat k times:
    • Sample a previously seen (ŝ, â) from the model table.
    • Sample (r̂, ŝ′) from the model for (ŝ, â).
    • Q-learning backup on (ŝ, â, r̂, ŝ′).
python
# Tabular Dyna-Q (sketch)
Q = defaultdict(float)
model_s = {}   # (s,a) -> s'
model_r = {}   # (s,a) -> r
seen_sa = []
 
def q_backup(s, a, r, s_next, alpha=0.1, gamma=0.95):
    best_next = max(Q[(s_next, a2)] for a2 in actions)
    Q[(s, a)] += alpha * (r + gamma * best_next - Q[(s, a)])
 
def dyna_step(s, a, r, s_next, n_plan=50):
    q_backup(s, a, r, s_next)
    model_s[(s, a)] = s_next
    model_r[(s, a)] = r
    if (s, a) not in seen_sa:
        seen_sa.append((s, a))
    for _ in range(n_plan):
        sh, ah = random.choice(seen_sa)
        rh = model_r[(sh, ah)]
        s_ph = model_s[(sh, ah)]
        q_backup(sh, ah, rh, s_ph)

The agent thinks between actions by replaying past state–action pairs through the learned model.


Worked example: Dyna maze

Consider a 9×9 grid with a single path to the goal. Q-learning alone may need thousands of episodes to propagate reward backward. Dyna-Q with k=50 after each step floods the Q-table with backups along the known corridor from the model after only a handful of visits.

SettingEpisodes to near-optimal QEnvironment steps
Q-learning, k=0~800~800
Dyna-Q, k=50~30~30
Dyna-Q, k=200~15~15 (diminishing returns)

Checkpoint: Why sample (ŝ, â) uniformly from seen pairs instead of always using the latest (s, a)?

Answer

Uniform sampling spreads planning across the state space the model knows. Always planning from the current state only reinforces local backups; uniform replay propagates value along all discovered corridors, similar to shuffled experience replay in DQN.


Model representation choices

Model typeStorageStochastic?Best for
Deterministic table(s,a) → (s′, r)NoSmall discrete MDPs
Transition countsP(s′s,a) from countsYes
Neural one-stepweights θCan beLarge / continuous (Dyna-style deep)

For stochastic environments, sample s′ from learned probabilities instead of a single stored next state. Otherwise the planner invents false certainty.

Dyna-Q+ (brief)

In changing environments, old model entries go stale. Dyna-Q+ adds an exploration bonus κ√τ(s,a) where τ is time since (s,a) was last visited in real experience — encourages revisiting to refresh the model.


Relation to experience replay

Dyna-Q (tabular)DQN replay buffer
Model generates (s,a,r,s′)Store real transitions
Sample seen (s,a), predict s′Sample random minibatch
Extra Q backupsExtra gradient steps

Deep Dyna variants learn a neural model and push imagined transitions into the replay buffer (or a separate buffer). The intuition is identical: decouple data collection from learning to reuse experience more efficiently.


When Dyna-Q fails

  • Sparse exploration: model only knows a tiny region → planning reinforces local noise.
  • High-dimensional states: tabular model impossible; need function approximation.
  • Model error: imagined backups reinforce wrong values faster (more updates per step!).
  • Very large k: wasted compute with little new information; can overfit model artifacts.

Start with k ≈ 10–50 on gridworlds; profile before scaling k.


Common mistakes

MistakeSymptomFix
Planning before any model entriesRandom simulated backupsWait until seen_sa non-empty
k=0 and expecting Dyna gainsSame as Q-learningSet k > 0
Deterministic model in slippery envBiased Q near cliffsUse transition counts
Never visiting frontier statesModel stale at boundaryε-greedy + Dyna-Q+ bonus
Confusing model updates with Q updatesModel overwrites wrongSeparate model_s / model_r dicts

Closing

Dyna-Q proves that planning is not magic — it is extra TD backups on simulated data. The same pattern scales to deep RL with neural models and replay. Your Module 7 project implements this on a gridworld; measure how k shifts the learning curve.


Before this lesson


What's next