Project: Multi-armed bandit
Before we begin
This is the Module 1 capstone. You will implement two classic bandit algorithms from scratch — no Gymnasium environment required. The goal is to feel exploration vs exploitation through cumulative regret, not just read about it.
Figure
One state, many actions
How this connects to Module 1
| Lesson | Where you use it |
|---|---|
| Agents & the RL loop | Pull arm → observe reward → update estimates |
| MDPs | Degenerate MDP: one state, K actions, stochastic rewards |
| Returns & discounting | Per-step reward; regret = optimal arm return − your return |
| Bellman / value | Q*(a) = true mean reward of arm a |
What you will build
| Piece | Purpose |
|---|---|
BernoulliBandit | 10 arms with hidden means in (0, 1) |
EpsilonGreedy | Explore with probability ε, else exploit best estimate |
UCB1 | Optimism under uncertainty — no ε tuning |
regret_plot.png | Cumulative regret vs step for both algorithms |
Folder layout:
bandit-lab/
bandit.py # environment + algorithms
run_experiment.py # main loop + plotting
outputs/
regret_plot.png
README.md # 3–5 sentences: which algorithm won and whyEstimated time: 2–3 hours.
Before you start
- Finish the Module 1 quiz.
- Python 3.10+:
pip install numpy matplotlibStep 1 — Bernoulli bandit environment
Goal: Simulate K slot machines. Each arm returns 1 with probability μᵢ (unknown to the agent).
# bandit.py
import numpy as np
class BernoulliBandit:
def __init__(self, k: int = 10, seed: int = 42):
rng = np.random.default_rng(seed)
self.k = k
self.means = rng.uniform(0.2, 0.8, size=k) # hidden from agent
self.optimal = float(self.means.max())
def pull(self, arm: int) -> float:
return 1.0 if np.random.random() < self.means[arm] else 0.0
def regret(self, arm: int) -> float:
return self.optimal - self.means[arm]Sanity check: Print bandit.means once for debugging, then never pass true means to your agents.
Step 2 — ε-greedy and UCB1
Goal: Maintain per-arm reward estimates and a pull count.
class EpsilonGreedy:
def __init__(self, k: int, epsilon: float = 0.1):
self.k = k
self.epsilon = epsilon
self.counts = np.zeros(k)
self.values = np.zeros(k)
def select(self) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(self.k)
return int(np.argmax(self.values))
def update(self, arm: int, reward: float) -> None:
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n
class UCB1:
def __init__(self, k: int, c: float = 2.0):
self.k = k
self.c = c
self.counts = np.zeros(k)
self.values = np.zeros(k)
self.t = 0
def select(self) -> int:
self.t += 1
for arm in range(self.k):
if self.counts[arm] == 0:
return arm
ucb = self.values + self.c * np.sqrt(np.log(self.t) / self.counts)
return int(np.argmax(ucb))
def update(self, arm: int, reward: float) -> None:
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / nStep 3 — Run experiment and plot regret
# run_experiment.py
import matplotlib.pyplot as plt
import numpy as np
from bandit import BernoulliBandit, EpsilonGreedy, UCB1
STEPS = 10_000
SEED = 42
def run(agent_cls, **kwargs):
bandit = BernoulliBandit(k=10, seed=SEED)
agent = agent_cls(10, **kwargs)
regret = []
total = 0.0
for _ in range(STEPS):
arm = agent.select()
reward = bandit.pull(arm)
agent.update(arm, reward)
total += bandit.regret(arm)
regret.append(total)
return np.array(regret)
reg_eps = run(EpsilonGreedy, epsilon=0.1)
reg_ucb = run(UCB1)
plt.plot(reg_eps, label="ε-greedy (ε=0.1)")
plt.plot(reg_ucb, label="UCB1")
plt.xlabel("Step"); plt.ylabel("Cumulative regret")
plt.legend(); plt.title("10-arm Bernoulli bandit")
plt.savefig("outputs/regret_plot.png", dpi=150)Step 4 — Deliverables & success criteria
| Criterion | Target |
|---|---|
| Both algorithms run 10k steps without errors | Required |
| UCB1 final regret lower than ε-greedy (same seed) | Expected |
| README explains one failure mode (e.g. ε too high) | Required |
| One ablation: try ε ∈ 0.3 | Recommended |
Extension ideas
- Plot instantaneous regret (per-step) with a rolling average.
- Implement Thompson sampling and add a third curve.
- Run 20 seeds and plot mean ± std shaded regions.
What's next
Return to the course curriculum and continue to the next module when your project runs end-to-end.