Project: Dyna-Q gridworld
Before we begin
Implement Dyna-Q: learn a tabular model of the environment, then perform extra planning updates from simulated experiences. Measure how planning steps k improve sample efficiency on a gridworld.
How this connects to Module 7
| Lesson | Where you use it |
|---|---|
| Learned models | Tabular model[s,a] → (s', r) |
| Dyna-Q | Real step + k simulated Q-updates per step |
| MCTS | Contrasts planning at decision time vs background Dyna |
| World models | Same idea scales to neural models later |
What you will build
| Piece | Purpose |
|---|---|
| Tabular Q-learning + model | Standard online RL |
| Dyna-Q loop | k planning backups per env step |
| Comparison plot | Steps to reach goal: Dyna vs Q-only |
Use a 5×5 grid with fixed start/goal/walls (your SimpleGridWorld) or CliffWalking-v0.
Estimated time: 4–5 hours.
Before you start
- Finish the Module 7 quiz.
pip install gymnasium numpy matplotlib
Step 1 — Model storage
import numpy as np
from collections import defaultdict
class TabularModel:
def __init__(self):
self.transitions = {} # (s,a) -> (s_next, r)
def update(self, s, a, s_next, r):
self.transitions[(s, a)] = (s_next, r)
def sample(self):
(s, a), (s_next, r) = self.transitions[
list(self.transitions.keys())[np.random.randint(len(self.transitions))]
]
return s, a, r, s_next, FalseOnce (s,a) is seen, treat the model as deterministic (tabular).
Step 2 — Dyna-Q inner loop
After each real (s, a, r, s', done) transition:
- Q-learning update on real experience.
model.update(s, a, s', r)- Repeat k times:
- Sample simulated
(ŝ, â, r̂, ŝ')from model - Q-learning update on simulated experience
- Sample simulated
def q_update(q, s, a, r, s_next, done, alpha=0.1, gamma=0.99, n_actions=4):
target = r if done else r + gamma * np.max(q[s_next])
q[s, a] += alpha * (target - q[s, a])Step 3 — Compare k ∈ 50
Run identical seeds with only k changing. Plot episodes to first success or area under learning curve for first 500 episodes.
| k | Meaning |
|---|---|
| 0 | Pure Q-learning (no planning) |
| 10 | Typical Dyna-Q |
| 50 | Heavy planning — may help or hurt if model wrong |
Success criteria
| Criterion | Target |
|---|---|
| Dyna-Q with k>0 reaches good policy faster than k=0 | Required |
| README explains what happens when model is wrong | Required |
| Policy visualization or path to goal | Recommended |
Extension ideas
- Prioritized sweeping: plan more from surprising transitions.
- Stochastic grid (slippery actions) — model mismatch discussion.
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.