← Back to curriculum

Module 2 — Tabular methods

Project: gridworld Q-learning

Train tabular Q-learning on a gridworld, visualize the greedy policy, tune ε and α.

~120 min read + exercises

Project: Gridworld Q-learning

Before we begin

Train tabular Q-learning on a small gridworld. You will see Bellman backups update a Q-table, then visualize the greedy policy as arrows — the same update rule DQN generalizes with neural nets in Module 4.


How this connects to Module 2

LessonWhere you use it
Dynamic programmingOptimal policy intuition to compare against
Monte CarloFull-episode returns vs bootstrapping
TD learningOne-step backup: Q(s,a) ← Q(s,a) + α(r + γ max Q(s',·) − Q(s,a))
Q-learning & SARSAOff-policy max backup (Q-learning)
On- vs off-policyε-greedy behavior, greedy evaluation

What you will build

PiecePurpose
GridWorld or CliffWalking-v0Small state space you can inspect
Tabular Q-learning2D array Q[state, action]
policy_arrows.pngGreedy action per cell
returns_curve.pngEpisode return vs episode

Folder layout:

text
gridworld-ql/
  env.py             # optional custom 5×5 grid
  q_learning.py      # agent + training loop
  visualize.py       # policy arrows + learning curve
  outputs/

Estimated time: 3–4 hours.


Before you start

  • Finish the Module 2 quiz.
  • pip install gymnasium numpy matplotlib

Option A (recommended): Gymnasium CliffWalking-v0 — 48 states, 4 actions, classic textbook env.

Option B: Custom 5×5 grid with start, goal, and walls (implement reset/step returning discrete state ids 0…N-1).


Step 1 — Environment setup

python
import gymnasium as gym
import numpy as np
 
env = gym.make("CliffWalking-v0")
n_states = env.observation_space.n   # 48
n_actions = env.action_space.n       # 4

Record a random policy baseline: average return over 100 episodes (expect large negative returns on Cliff Walking).


Step 2 — Tabular Q-learning

python
class QLearningAgent:
    def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
        self.q = np.zeros((n_states, n_actions))
        self.lr, self.gamma, self.epsilon = lr, gamma, epsilon
        self.n_actions = n_actions
 
    def act(self, state, explore=True):
        if explore and np.random.random() < self.epsilon:
            return env.action_space.sample()
        return int(np.argmax(self.q[state]))
 
    def learn(self, s, a, r, s_next, done):
        target = r if done else r + self.gamma * np.max(self.q[s_next])
        self.q[s, a] += self.lr * (target - self.q[s, a])

Hyperparameters to keep in a config dict: lr, gamma, epsilon, num_episodes (try 500–2000).


Step 3 — Training loop

python
agent = QLearningAgent(n_states, n_actions)
returns = []
 
for ep in range(2000):
    s, _ = env.reset(seed=ep)
    total_r = 0.0
    done = False
    while not done:
        a = agent.act(s)
        s_next, r, term, trunc, _ = env.step(a)
        done = term or trunc
        agent.learn(s, a, r, s_next, done)
        s = s_next
        total_r += r
    returns.append(total_r)

Decay ε over time (e.g. linear 0.3 → 0.05) for faster convergence.


Step 4 — Visualize greedy policy

For CliffWalking, states are a 4×12 grid (row-major). Map each state to argmax action and draw arrows (↑↓←→).

python
policy = np.argmax(agent.q, axis=1)  # shape (48,)
# map action 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT to unicode arrows for matplotlib

Success criteria

CriterionTarget
Learning curve trends upwardRequired
Greedy policy reaches goal without falling off cliffRequired
Final 100-episode mean return > −50 on CliffWalkingTypical
Document one α/ε change and its effectRequired

Extension ideas

  • Implement SARSA and compare on the same env.
  • Custom 5×5 windy grid with stochastic transitions.
  • Heatmap of max_a Q(s,a) per cell.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.