Project: Gridworld Q-learning

Before we begin

Train tabular Q-learning on a small gridworld. You will see Bellman backups update a Q-table, then visualize the greedy policy as arrows — the same update rule DQN generalizes with neural nets in Module 4.

How this connects to Module 2

Lesson	Where you use it
Dynamic programming	Optimal policy intuition to compare against
Monte Carlo	Full-episode returns vs bootstrapping
TD learning	One-step backup: Q(s,a) ← Q(s,a) + α(r + γ max Q(s',·) − Q(s,a))
Q-learning & SARSA	Off-policy max backup (Q-learning)
On- vs off-policy	ε-greedy behavior, greedy evaluation

What you will build

Piece	Purpose
`GridWorld` or `CliffWalking-v0`	Small state space you can inspect
Tabular Q-learning	2D array `Q[state, action]`
`policy_arrows.png`	Greedy action per cell
`returns_curve.png`	Episode return vs episode

Folder layout:

text

gridworld-ql/
  env.py             # optional custom 5×5 grid
  q_learning.py      # agent + training loop
  visualize.py       # policy arrows + learning curve
  outputs/

Estimated time: 3–4 hours.

Before you start

Finish the Module 2 quiz.
pip install gymnasium numpy matplotlib

Option A (recommended): Gymnasium CliffWalking-v0 — 48 states, 4 actions, classic textbook env.

Option B: Custom 5×5 grid with start, goal, and walls (implement reset/step returning discrete state ids 0…N-1).

Step 1 — Environment setup

python

import gymnasium as gym
import numpy as np
 
env = gym.make("CliffWalking-v0")
n_states = env.observation_space.n   # 48
n_actions = env.action_space.n       # 4

Record a random policy baseline: average return over 100 episodes (expect large negative returns on Cliff Walking).

Step 2 — Tabular Q-learning

python

class QLearningAgent:
    def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
        self.q = np.zeros((n_states, n_actions))
        self.lr, self.gamma, self.epsilon = lr, gamma, epsilon
        self.n_actions = n_actions
 
    def act(self, state, explore=True):
        if explore and np.random.random() < self.epsilon:
            return env.action_space.sample()
        return int(np.argmax(self.q[state]))
 
    def learn(self, s, a, r, s_next, done):
        target = r if done else r + self.gamma * np.max(self.q[s_next])
        self.q[s, a] += self.lr * (target - self.q[s, a])

Hyperparameters to keep in a config dict: lr, gamma, epsilon, num_episodes (try 500–2000).

Step 3 — Training loop

python

agent = QLearningAgent(n_states, n_actions)
returns = []
 
for ep in range(2000):
    s, _ = env.reset(seed=ep)
    total_r = 0.0
    done = False
    while not done:
        a = agent.act(s)
        s_next, r, term, trunc, _ = env.step(a)
        done = term or trunc
        agent.learn(s, a, r, s_next, done)
        s = s_next
        total_r += r
    returns.append(total_r)

Decay ε over time (e.g. linear 0.3 → 0.05) for faster convergence.

Step 4 — Visualize greedy policy

For CliffWalking, states are a 4×12 grid (row-major). Map each state to argmax action and draw arrows (↑↓←→).

python

policy = np.argmax(agent.q, axis=1)  # shape (48,)
# map action 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT to unicode arrows for matplotlib

Success criteria

Criterion	Target
Learning curve trends upward	Required
Greedy policy reaches goal without falling off cliff	Required
Final 100-episode mean return > −50 on CliffWalking	Typical
Document one α/ε change and its effect	Required

Extension ideas

Implement SARSA and compare on the same env.
Custom 5×5 windy grid with stochastic transitions.
Heatmap of max_a Q(s,a) per cell.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.