Project: Gridworld Q-learning
Before we begin
Train tabular Q-learning on a small gridworld. You will see Bellman backups update a Q-table, then visualize the greedy policy as arrows — the same update rule DQN generalizes with neural nets in Module 4.
How this connects to Module 2
| Lesson | Where you use it |
|---|---|
| Dynamic programming | Optimal policy intuition to compare against |
| Monte Carlo | Full-episode returns vs bootstrapping |
| TD learning | One-step backup: Q(s,a) ← Q(s,a) + α(r + γ max Q(s',·) − Q(s,a)) |
| Q-learning & SARSA | Off-policy max backup (Q-learning) |
| On- vs off-policy | ε-greedy behavior, greedy evaluation |
What you will build
| Piece | Purpose |
|---|---|
GridWorld or CliffWalking-v0 | Small state space you can inspect |
| Tabular Q-learning | 2D array Q[state, action] |
policy_arrows.png | Greedy action per cell |
returns_curve.png | Episode return vs episode |
Folder layout:
gridworld-ql/
env.py # optional custom 5×5 grid
q_learning.py # agent + training loop
visualize.py # policy arrows + learning curve
outputs/Estimated time: 3–4 hours.
Before you start
- Finish the Module 2 quiz.
pip install gymnasium numpy matplotlib
Option A (recommended): Gymnasium CliffWalking-v0 — 48 states, 4 actions, classic textbook env.
Option B: Custom 5×5 grid with start, goal, and walls (implement reset/step returning discrete state ids 0…N-1).
Step 1 — Environment setup
import gymnasium as gym
import numpy as np
env = gym.make("CliffWalking-v0")
n_states = env.observation_space.n # 48
n_actions = env.action_space.n # 4Record a random policy baseline: average return over 100 episodes (expect large negative returns on Cliff Walking).
Step 2 — Tabular Q-learning
class QLearningAgent:
def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
self.q = np.zeros((n_states, n_actions))
self.lr, self.gamma, self.epsilon = lr, gamma, epsilon
self.n_actions = n_actions
def act(self, state, explore=True):
if explore and np.random.random() < self.epsilon:
return env.action_space.sample()
return int(np.argmax(self.q[state]))
def learn(self, s, a, r, s_next, done):
target = r if done else r + self.gamma * np.max(self.q[s_next])
self.q[s, a] += self.lr * (target - self.q[s, a])Hyperparameters to keep in a config dict: lr, gamma, epsilon, num_episodes (try 500–2000).
Step 3 — Training loop
agent = QLearningAgent(n_states, n_actions)
returns = []
for ep in range(2000):
s, _ = env.reset(seed=ep)
total_r = 0.0
done = False
while not done:
a = agent.act(s)
s_next, r, term, trunc, _ = env.step(a)
done = term or trunc
agent.learn(s, a, r, s_next, done)
s = s_next
total_r += r
returns.append(total_r)Decay ε over time (e.g. linear 0.3 → 0.05) for faster convergence.
Step 4 — Visualize greedy policy
For CliffWalking, states are a 4×12 grid (row-major). Map each state to argmax action and draw arrows (↑↓←→).
policy = np.argmax(agent.q, axis=1) # shape (48,)
# map action 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT to unicode arrows for matplotlibSuccess criteria
| Criterion | Target |
|---|---|
| Learning curve trends upward | Required |
| Greedy policy reaches goal without falling off cliff | Required |
| Final 100-episode mean return > −50 on CliffWalking | Typical |
| Document one α/ε change and its effect | Required |
Extension ideas
- Implement SARSA and compare on the same env.
- Custom 5×5 windy grid with stochastic transitions.
- Heatmap of
max_a Q(s,a)per cell.
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.