Dyna-Q & simulation

Before we begin

Dyna-Q is the cleanest bridge between tabular Q-learning and model-based RL. You learn a model from real experience, then perform additional Q-learning updates on simulated experience drawn from that model — all without extra environment steps. Sutton's Dyna architecture shows how planning can be a few lines of code on top of standard TD learning.

Dyna-Q — after each real step, update the model, then run k planning steps (simulated transitions + Q backups).
Direct RL — learning from real experience only.
Indirect RL — learning from simulated experience via the model.

What you will learn

Implement the Dyna-Q loop: act, learn model, plan k steps.
Build a tabular model as dictionaries for P(s′|s,a) and R(s,a).
Tune n_planning_steps (k) and see speed vs accuracy trade-offs.
Connect Dyna to experience replay in deep RL (same idea, neural Q).
Diagnose when planning hurts because the model is wrong.

The Dyna-Q algorithm

On each real transition (s, a, r, s′):

Direct RL: standard Q-learning backup on (s, a, r, s′).
Model learning: record that (s, a) leads to (s′, r).
Planning: repeat k times:
- Sample a previously seen (ŝ, â) from the model table.
- Sample (r̂, ŝ′) from the model for (ŝ, â).
- Q-learning backup on (ŝ, â, r̂, ŝ′).

python

# Tabular Dyna-Q (sketch)
Q = defaultdict(float)
model_s = {}   # (s,a) -> s'
model_r = {}   # (s,a) -> r
seen_sa = []
 
def q_backup(s, a, r, s_next, alpha=0.1, gamma=0.95):
    best_next = max(Q[(s_next, a2)] for a2 in actions)
    Q[(s, a)] += alpha * (r + gamma * best_next - Q[(s, a)])
 
def dyna_step(s, a, r, s_next, n_plan=50):
    q_backup(s, a, r, s_next)
    model_s[(s, a)] = s_next
    model_r[(s, a)] = r
    if (s, a) not in seen_sa:
        seen_sa.append((s, a))
    for _ in range(n_plan):
        sh, ah = random.choice(seen_sa)
        rh = model_r[(sh, ah)]
        s_ph = model_s[(sh, ah)]
        q_backup(sh, ah, rh, s_ph)

The agent thinks between actions by replaying past state–action pairs through the learned model.

Worked example: Dyna maze

Consider a 9×9 grid with a single path to the goal. Q-learning alone may need thousands of episodes to propagate reward backward. Dyna-Q with k=50 after each step floods the Q-table with backups along the known corridor from the model after only a handful of visits.

Setting	Episodes to near-optimal Q	Environment steps
Q-learning, k=0	~800	~800
Dyna-Q, k=50	~30	~30
Dyna-Q, k=200	~15	~15 (diminishing returns)

Checkpoint: Why sample (ŝ, â) uniformly from seen pairs instead of always using the latest (s, a)?

Answer

Uniform sampling spreads planning across the state space the model knows. Always planning from the current state only reinforces local backups; uniform replay propagates value along all discovered corridors, similar to shuffled experience replay in DQN.

Model representation choices

Model type	Storage	Stochastic?	Best for
Deterministic table	(s,a) → (s′, r)	No	Small discrete MDPs
Transition counts	P(s′	s,a) from counts	Yes
Neural one-step	weights θ	Can be	Large / continuous (Dyna-style deep)

For stochastic environments, sample s′ from learned probabilities instead of a single stored next state. Otherwise the planner invents false certainty.

Dyna-Q+ (brief)

In changing environments, old model entries go stale. Dyna-Q+ adds an exploration bonus κ√τ(s,a) where τ is time since (s,a) was last visited in real experience — encourages revisiting to refresh the model.

Relation to experience replay

Dyna-Q (tabular)	DQN replay buffer
Model generates (s,a,r,s′)	Store real transitions
Sample seen (s,a), predict s′	Sample random minibatch
Extra Q backups	Extra gradient steps

Deep Dyna variants learn a neural model and push imagined transitions into the replay buffer (or a separate buffer). The intuition is identical: decouple data collection from learning to reuse experience more efficiently.

When Dyna-Q fails

Sparse exploration: model only knows a tiny region → planning reinforces local noise.
High-dimensional states: tabular model impossible; need function approximation.
Model error: imagined backups reinforce wrong values faster (more updates per step!).
Very large k: wasted compute with little new information; can overfit model artifacts.

Start with k ≈ 10–50 on gridworlds; profile before scaling k.

Common mistakes

Mistake	Symptom	Fix
Planning before any model entries	Random simulated backups	Wait until seen_sa non-empty
k=0 and expecting Dyna gains	Same as Q-learning	Set k > 0
Deterministic model in slippery env	Biased Q near cliffs	Use transition counts
Never visiting frontier states	Model stale at boundary	ε-greedy + Dyna-Q+ bonus
Confusing model updates with Q updates	Model overwrites wrong	Separate model_s / model_r dicts

Closing

Dyna-Q proves that planning is not magic — it is extra TD backups on simulated data. The same pattern scales to deep RL with neural models and replay. Your Module 7 project implements this on a gridworld; measure how k shifts the learning curve.

Before this lesson

Previous lesson

What's next

Next lesson — Monte Carlo tree search