Project: Multi-armed bandit

Before we begin

This is the Module 1 capstone. You will implement two classic bandit algorithms from scratch — no Gymnasium environment required. The goal is to feel exploration vs exploitation through cumulative regret, not just read about it.

Figure

One state, many actions

Bandits strip RL down to the exploration problem before full MDPs.

How this connects to Module 1

Lesson	Where you use it
Agents & the RL loop	Pull arm → observe reward → update estimates
MDPs	Degenerate MDP: one state, K actions, stochastic rewards
Returns & discounting	Per-step reward; regret = optimal arm return − your return
Bellman / value	Q*(a) = true mean reward of arm a

What you will build

Piece	Purpose
`BernoulliBandit`	10 arms with hidden means in (0, 1)
`EpsilonGreedy`	Explore with probability ε, else exploit best estimate
`UCB1`	Optimism under uncertainty — no ε tuning
`regret_plot.png`	Cumulative regret vs step for both algorithms

Folder layout:

text

bandit-lab/
  bandit.py          # environment + algorithms
  run_experiment.py  # main loop + plotting
  outputs/
    regret_plot.png
  README.md          # 3–5 sentences: which algorithm won and why

Estimated time: 2–3 hours.

Before you start

Finish the Module 1 quiz.
Python 3.10+:

bash

pip install numpy matplotlib

Step 1 — Bernoulli bandit environment

Goal: Simulate K slot machines. Each arm returns 1 with probability μᵢ (unknown to the agent).

python

# bandit.py
import numpy as np
 
class BernoulliBandit:
    def __init__(self, k: int = 10, seed: int = 42):
        rng = np.random.default_rng(seed)
        self.k = k
        self.means = rng.uniform(0.2, 0.8, size=k)  # hidden from agent
        self.optimal = float(self.means.max())
 
    def pull(self, arm: int) -> float:
        return 1.0 if np.random.random() < self.means[arm] else 0.0
 
    def regret(self, arm: int) -> float:
        return self.optimal - self.means[arm]

Sanity check: Print bandit.means once for debugging, then never pass true means to your agents.

Step 2 — ε-greedy and UCB1

Goal: Maintain per-arm reward estimates and a pull count.

python

class EpsilonGreedy:
    def __init__(self, k: int, epsilon: float = 0.1):
        self.k = k
        self.epsilon = epsilon
        self.counts = np.zeros(k)
        self.values = np.zeros(k)
 
    def select(self) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.k)
        return int(np.argmax(self.values))
 
    def update(self, arm: int, reward: float) -> None:
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n
 
 
class UCB1:
    def __init__(self, k: int, c: float = 2.0):
        self.k = k
        self.c = c
        self.counts = np.zeros(k)
        self.values = np.zeros(k)
        self.t = 0
 
    def select(self) -> int:
        self.t += 1
        for arm in range(self.k):
            if self.counts[arm] == 0:
                return arm
        ucb = self.values + self.c * np.sqrt(np.log(self.t) / self.counts)
        return int(np.argmax(ucb))
 
    def update(self, arm: int, reward: float) -> None:
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

Step 3 — Run experiment and plot regret

python

# run_experiment.py
import matplotlib.pyplot as plt
import numpy as np
from bandit import BernoulliBandit, EpsilonGreedy, UCB1
 
STEPS = 10_000
SEED = 42
 
def run(agent_cls, **kwargs):
    bandit = BernoulliBandit(k=10, seed=SEED)
    agent = agent_cls(10, **kwargs)
    regret = []
    total = 0.0
    for _ in range(STEPS):
        arm = agent.select()
        reward = bandit.pull(arm)
        agent.update(arm, reward)
        total += bandit.regret(arm)
        regret.append(total)
    return np.array(regret)
 
reg_eps = run(EpsilonGreedy, epsilon=0.1)
reg_ucb = run(UCB1)
 
plt.plot(reg_eps, label="ε-greedy (ε=0.1)")
plt.plot(reg_ucb, label="UCB1")
plt.xlabel("Step"); plt.ylabel("Cumulative regret")
plt.legend(); plt.title("10-arm Bernoulli bandit")
plt.savefig("outputs/regret_plot.png", dpi=150)

Step 4 — Deliverables & success criteria

Criterion	Target
Both algorithms run 10k steps without errors	Required
UCB1 final regret lower than ε-greedy (same seed)	Expected
README explains one failure mode (e.g. ε too high)	Required
One ablation: try ε ∈ 0.3	Recommended

Extension ideas

Plot instantaneous regret (per-step) with a rolling average.
Implement Thompson sampling and add a third curve.
Run 20 seeds and plot mean ± std shaded regions.

What's next

Return to the course curriculum and continue to the next module when your project runs end-to-end.