← Back to curriculum

Module 7 — Model-based RL

Project: Dyna-Q gridworld

Tabular Dyna-Q; compare sample efficiency with vs without planning steps.

~110 min read + exercises

Project: Dyna-Q gridworld

Before we begin

Implement Dyna-Q: learn a tabular model of the environment, then perform extra planning updates from simulated experiences. Measure how planning steps k improve sample efficiency on a gridworld.


How this connects to Module 7

LessonWhere you use it
Learned modelsTabular model[s,a] → (s', r)
Dyna-QReal step + k simulated Q-updates per step
MCTSContrasts planning at decision time vs background Dyna
World modelsSame idea scales to neural models later

What you will build

PiecePurpose
Tabular Q-learning + modelStandard online RL
Dyna-Q loopk planning backups per env step
Comparison plotSteps to reach goal: Dyna vs Q-only

Use a 5×5 grid with fixed start/goal/walls (your SimpleGridWorld) or CliffWalking-v0.

Estimated time: 4–5 hours.


Before you start

  • Finish the Module 7 quiz.
  • pip install gymnasium numpy matplotlib

Step 1 — Model storage

python
import numpy as np
from collections import defaultdict
 
class TabularModel:
    def __init__(self):
        self.transitions = {}  # (s,a) -> (s_next, r)
 
    def update(self, s, a, s_next, r):
        self.transitions[(s, a)] = (s_next, r)
 
    def sample(self):
        (s, a), (s_next, r) = self.transitions[
            list(self.transitions.keys())[np.random.randint(len(self.transitions))]
        ]
        return s, a, r, s_next, False

Once (s,a) is seen, treat the model as deterministic (tabular).


Step 2 — Dyna-Q inner loop

After each real (s, a, r, s', done) transition:

  1. Q-learning update on real experience.
  2. model.update(s, a, s', r)
  3. Repeat k times:
    • Sample simulated (ŝ, â, r̂, ŝ') from model
    • Q-learning update on simulated experience
python
def q_update(q, s, a, r, s_next, done, alpha=0.1, gamma=0.99, n_actions=4):
    target = r if done else r + gamma * np.max(q[s_next])
    q[s, a] += alpha * (target - q[s, a])

Step 3 — Compare k ∈ 50

Run identical seeds with only k changing. Plot episodes to first success or area under learning curve for first 500 episodes.

kMeaning
0Pure Q-learning (no planning)
10Typical Dyna-Q
50Heavy planning — may help or hurt if model wrong

Success criteria

CriterionTarget
Dyna-Q with k>0 reaches good policy faster than k=0Required
README explains what happens when model is wrongRequired
Policy visualization or path to goalRecommended

Extension ideas

  • Prioritized sweeping: plan more from surprising transitions.
  • Stochastic grid (slippery actions) — model mismatch discussion.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.