Project: Dyna-Q gridworld

Before we begin

Implement Dyna-Q: learn a tabular model of the environment, then perform extra planning updates from simulated experiences. Measure how planning steps k improve sample efficiency on a gridworld.

How this connects to Module 7

Lesson	Where you use it
Learned models	Tabular `model[s,a] → (s', r)`
Dyna-Q	Real step + k simulated Q-updates per step
MCTS	Contrasts planning at decision time vs background Dyna
World models	Same idea scales to neural models later

What you will build

Piece	Purpose
Tabular Q-learning + model	Standard online RL
Dyna-Q loop	k planning backups per env step
Comparison plot	Steps to reach goal: Dyna vs Q-only

Use a 5×5 grid with fixed start/goal/walls (your SimpleGridWorld) or CliffWalking-v0.

Estimated time: 4–5 hours.

Before you start

Finish the Module 7 quiz.
pip install gymnasium numpy matplotlib

Step 1 — Model storage

python

import numpy as np
from collections import defaultdict
 
class TabularModel:
    def __init__(self):
        self.transitions = {}  # (s,a) -> (s_next, r)
 
    def update(self, s, a, s_next, r):
        self.transitions[(s, a)] = (s_next, r)
 
    def sample(self):
        (s, a), (s_next, r) = self.transitions[
            list(self.transitions.keys())[np.random.randint(len(self.transitions))]
        ]
        return s, a, r, s_next, False

Once (s,a) is seen, treat the model as deterministic (tabular).

Step 2 — Dyna-Q inner loop

After each real (s, a, r, s', done) transition:

Q-learning update on real experience.
model.update(s, a, s', r)
Repeat k times:
- Sample simulated (ŝ, â, r̂, ŝ') from model
- Q-learning update on simulated experience

python

def q_update(q, s, a, r, s_next, done, alpha=0.1, gamma=0.99, n_actions=4):
    target = r if done else r + gamma * np.max(q[s_next])
    q[s, a] += alpha * (target - q[s, a])

Step 3 — Compare k ∈ 50

Run identical seeds with only k changing. Plot episodes to first success or area under learning curve for first 500 episodes.

k	Meaning
0	Pure Q-learning (no planning)
10	Typical Dyna-Q
50	Heavy planning — may help or hurt if model wrong

Success criteria

Criterion	Target
Dyna-Q with k>0 reaches good policy faster than k=0	Required
README explains what happens when model is wrong	Required
Policy visualization or path to goal	Recommended

Extension ideas

Prioritized sweeping: plan more from surprising transitions.
Stochastic grid (slippery actions) — model mismatch discussion.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.