Dyna-Q & simulation
Before we begin
Dyna-Q is the cleanest bridge between tabular Q-learning and model-based RL. You learn a model from real experience, then perform additional Q-learning updates on simulated experience drawn from that model — all without extra environment steps. Sutton's Dyna architecture shows how planning can be a few lines of code on top of standard TD learning.
Dyna-Q — after each real step, update the model, then run k planning steps (simulated transitions + Q backups).
Direct RL — learning from real experience only.
Indirect RL — learning from simulated experience via the model.
What you will learn
- Implement the Dyna-Q loop: act, learn model, plan k steps.
- Build a tabular model as dictionaries for P(s′|s,a) and R(s,a).
- Tune n_planning_steps (k) and see speed vs accuracy trade-offs.
- Connect Dyna to experience replay in deep RL (same idea, neural Q).
- Diagnose when planning hurts because the model is wrong.
The Dyna-Q algorithm
On each real transition (s, a, r, s′):
- Direct RL: standard Q-learning backup on (s, a, r, s′).
- Model learning: record that (s, a) leads to (s′, r).
- Planning: repeat k times:
- Sample a previously seen (ŝ, â) from the model table.
- Sample (r̂, ŝ′) from the model for (ŝ, â).
- Q-learning backup on (ŝ, â, r̂, ŝ′).
# Tabular Dyna-Q (sketch)
Q = defaultdict(float)
model_s = {} # (s,a) -> s'
model_r = {} # (s,a) -> r
seen_sa = []
def q_backup(s, a, r, s_next, alpha=0.1, gamma=0.95):
best_next = max(Q[(s_next, a2)] for a2 in actions)
Q[(s, a)] += alpha * (r + gamma * best_next - Q[(s, a)])
def dyna_step(s, a, r, s_next, n_plan=50):
q_backup(s, a, r, s_next)
model_s[(s, a)] = s_next
model_r[(s, a)] = r
if (s, a) not in seen_sa:
seen_sa.append((s, a))
for _ in range(n_plan):
sh, ah = random.choice(seen_sa)
rh = model_r[(sh, ah)]
s_ph = model_s[(sh, ah)]
q_backup(sh, ah, rh, s_ph)The agent thinks between actions by replaying past state–action pairs through the learned model.
Worked example: Dyna maze
Consider a 9×9 grid with a single path to the goal. Q-learning alone may need thousands of episodes to propagate reward backward. Dyna-Q with k=50 after each step floods the Q-table with backups along the known corridor from the model after only a handful of visits.
| Setting | Episodes to near-optimal Q | Environment steps |
|---|---|---|
| Q-learning, k=0 | ~800 | ~800 |
| Dyna-Q, k=50 | ~30 | ~30 |
| Dyna-Q, k=200 | ~15 | ~15 (diminishing returns) |
Checkpoint: Why sample (ŝ, â) uniformly from seen pairs instead of always using the latest (s, a)?
Answer
Uniform sampling spreads planning across the state space the model knows. Always planning from the current state only reinforces local backups; uniform replay propagates value along all discovered corridors, similar to shuffled experience replay in DQN.
Model representation choices
| Model type | Storage | Stochastic? | Best for |
|---|---|---|---|
| Deterministic table | (s,a) → (s′, r) | No | Small discrete MDPs |
| Transition counts | P(s′ | s,a) from counts | Yes |
| Neural one-step | weights θ | Can be | Large / continuous (Dyna-style deep) |
For stochastic environments, sample s′ from learned probabilities instead of a single stored next state. Otherwise the planner invents false certainty.
Dyna-Q+ (brief)
In changing environments, old model entries go stale. Dyna-Q+ adds an exploration bonus κ√τ(s,a) where τ is time since (s,a) was last visited in real experience — encourages revisiting to refresh the model.
Relation to experience replay
| Dyna-Q (tabular) | DQN replay buffer |
|---|---|
| Model generates (s,a,r,s′) | Store real transitions |
| Sample seen (s,a), predict s′ | Sample random minibatch |
| Extra Q backups | Extra gradient steps |
Deep Dyna variants learn a neural model and push imagined transitions into the replay buffer (or a separate buffer). The intuition is identical: decouple data collection from learning to reuse experience more efficiently.
When Dyna-Q fails
- Sparse exploration: model only knows a tiny region → planning reinforces local noise.
- High-dimensional states: tabular model impossible; need function approximation.
- Model error: imagined backups reinforce wrong values faster (more updates per step!).
- Very large k: wasted compute with little new information; can overfit model artifacts.
Start with k ≈ 10–50 on gridworlds; profile before scaling k.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Planning before any model entries | Random simulated backups | Wait until seen_sa non-empty |
| k=0 and expecting Dyna gains | Same as Q-learning | Set k > 0 |
| Deterministic model in slippery env | Biased Q near cliffs | Use transition counts |
| Never visiting frontier states | Model stale at boundary | ε-greedy + Dyna-Q+ bonus |
| Confusing model updates with Q updates | Model overwrites wrong | Separate model_s / model_r dicts |
Closing
Dyna-Q proves that planning is not magic — it is extra TD backups on simulated data. The same pattern scales to deep RL with neural models and replay. Your Module 7 project implements this on a gridworld; measure how k shifts the learning curve.