← Back to curriculum

Module 1 — RL foundations & MDPs

Project: multi-armed bandit

Implement ε-greedy and UCB1, plot cumulative regret, compare exploration strategies.

~90 min read + exercises

Project: Multi-armed bandit

Before we begin

This is the Module 1 capstone. You will implement two classic bandit algorithms from scratch — no Gymnasium environment required. The goal is to feel exploration vs exploitation through cumulative regret, not just read about it.

Figure

One state, many actions

Agentpolicy πEnvironmentaction astate s′, reward r
Bandits strip RL down to the exploration problem before full MDPs.

How this connects to Module 1

LessonWhere you use it
Agents & the RL loopPull arm → observe reward → update estimates
MDPsDegenerate MDP: one state, K actions, stochastic rewards
Returns & discountingPer-step reward; regret = optimal arm return − your return
Bellman / valueQ*(a) = true mean reward of arm a

What you will build

PiecePurpose
BernoulliBandit10 arms with hidden means in (0, 1)
EpsilonGreedyExplore with probability ε, else exploit best estimate
UCB1Optimism under uncertainty — no ε tuning
regret_plot.pngCumulative regret vs step for both algorithms

Folder layout:

text
bandit-lab/
  bandit.py          # environment + algorithms
  run_experiment.py  # main loop + plotting
  outputs/
    regret_plot.png
  README.md          # 3–5 sentences: which algorithm won and why

Estimated time: 2–3 hours.


Before you start

bash
pip install numpy matplotlib

Step 1 — Bernoulli bandit environment

Goal: Simulate K slot machines. Each arm returns 1 with probability μᵢ (unknown to the agent).

python
# bandit.py
import numpy as np
 
class BernoulliBandit:
    def __init__(self, k: int = 10, seed: int = 42):
        rng = np.random.default_rng(seed)
        self.k = k
        self.means = rng.uniform(0.2, 0.8, size=k)  # hidden from agent
        self.optimal = float(self.means.max())
 
    def pull(self, arm: int) -> float:
        return 1.0 if np.random.random() < self.means[arm] else 0.0
 
    def regret(self, arm: int) -> float:
        return self.optimal - self.means[arm]

Sanity check: Print bandit.means once for debugging, then never pass true means to your agents.


Step 2 — ε-greedy and UCB1

Goal: Maintain per-arm reward estimates and a pull count.

python
class EpsilonGreedy:
    def __init__(self, k: int, epsilon: float = 0.1):
        self.k = k
        self.epsilon = epsilon
        self.counts = np.zeros(k)
        self.values = np.zeros(k)
 
    def select(self) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.k)
        return int(np.argmax(self.values))
 
    def update(self, arm: int, reward: float) -> None:
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n
 
 
class UCB1:
    def __init__(self, k: int, c: float = 2.0):
        self.k = k
        self.c = c
        self.counts = np.zeros(k)
        self.values = np.zeros(k)
        self.t = 0
 
    def select(self) -> int:
        self.t += 1
        for arm in range(self.k):
            if self.counts[arm] == 0:
                return arm
        ucb = self.values + self.c * np.sqrt(np.log(self.t) / self.counts)
        return int(np.argmax(ucb))
 
    def update(self, arm: int, reward: float) -> None:
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

Step 3 — Run experiment and plot regret

python
# run_experiment.py
import matplotlib.pyplot as plt
import numpy as np
from bandit import BernoulliBandit, EpsilonGreedy, UCB1
 
STEPS = 10_000
SEED = 42
 
def run(agent_cls, **kwargs):
    bandit = BernoulliBandit(k=10, seed=SEED)
    agent = agent_cls(10, **kwargs)
    regret = []
    total = 0.0
    for _ in range(STEPS):
        arm = agent.select()
        reward = bandit.pull(arm)
        agent.update(arm, reward)
        total += bandit.regret(arm)
        regret.append(total)
    return np.array(regret)
 
reg_eps = run(EpsilonGreedy, epsilon=0.1)
reg_ucb = run(UCB1)
 
plt.plot(reg_eps, label="ε-greedy (ε=0.1)")
plt.plot(reg_ucb, label="UCB1")
plt.xlabel("Step"); plt.ylabel("Cumulative regret")
plt.legend(); plt.title("10-arm Bernoulli bandit")
plt.savefig("outputs/regret_plot.png", dpi=150)

Step 4 — Deliverables & success criteria

CriterionTarget
Both algorithms run 10k steps without errorsRequired
UCB1 final regret lower than ε-greedy (same seed)Expected
README explains one failure mode (e.g. ε too high)Required
One ablation: try ε ∈ 0.3Recommended

Extension ideas

  • Plot instantaneous regret (per-step) with a rolling average.
  • Implement Thompson sampling and add a third curve.
  • Run 20 seeds and plot mean ± std shaded regions.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.