Module 1. Module 1 — RL foundations & MDPs

Agents, environments, Markov decision processes, returns, discounting, and Bellman equations — ending with a multi-armed bandit project.

Lesson 130 min
Welcome — start here
RL vocabulary, how to read lessons, Module 1 roadmap, and what to install before the bandit project.
Lesson 255 min
Agents, environments & the RL loop
State, action, reward, policy; exploration vs exploitation; contrast with supervised learning.
Lesson 360 min
Markov decision processes
The MDP tuple, Markov property, transition dynamics, and mapping to Gymnasium spaces.
Lesson 455 min
Returns, discounting & episodes
Discounted return, episodic vs continuing tasks, and numeric worked examples.
Lesson 565 min
Bellman equations & value functions
V^π, Q^π, Bellman expectation and optimality, and greedy policies from Q*.
Lesson 645 min
Module 1 quiz & review
20 interactive MCQs with instant feedback and lesson links for topics you miss.
Lesson 790 min
Project: multi-armed bandit
Implement ε-greedy and UCB1, plot cumulative regret, compare exploration strategies.

Module 2. Module 2 — Tabular methods

Dynamic programming, Monte Carlo, temporal-difference learning, Q-learning, SARSA, and on- vs off-policy control — with a gridworld project.

Module 3. Module 3 — Function approximation

Why tables fail at scale, linear FA, neural nets as value approximators, and the deadly triad — with a CartPole linear-FA lab.

Module 4. Module 4 — Deep Q-networks

From tabular Q-learning to DQN: experience replay, target networks, Double/Dueling/PER, debugging — with a CartPole DQN project.

Module 5. Module 5 — Policy gradients

REINFORCE, the policy gradient theorem, baselines, and actor–critic — with a REINFORCE CartPole project.

Module 6. Module 6 — Actor–critic & PPO

GAE, TRPO intuition, proximal policy optimization, A2C — with a Lunar Lander PPO project.

Module 7. Module 7 — Model-based RL

Learned dynamics, Dyna-Q, MCTS, world models — with a Dyna-Q gridworld project.

Module 8. Module 8 — Continuous control & robotics

Continuous actions, DDPG, SAC, sim-to-real, robotics case studies — with a SAC Pendulum project.

Module 9. Module 9 — Production & advanced topics

Offline RL, exploration, multi-agent basics, safety, monitoring — with a production RL serving project.

Lessons in this path

Module 1. Module 1 — RL foundations & MDPs

Welcome — start here

Agents, environments & the RL loop

Markov decision processes

Returns, discounting & episodes

Bellman equations & value functions

Module 1 quiz & review

Project: multi-armed bandit

Module 2. Module 2 — Tabular methods

Welcome to Module 2

Dynamic programming — policy & value iteration

Monte Carlo methods

Temporal-difference learning

Q-learning & SARSA

On-policy vs off-policy

Module 2 quiz & review

Project: gridworld Q-learning

Module 3. Module 3 — Function approximation

Welcome to Module 3

Why tabular methods break

Linear function approximation

Neural networks as approximators

Instability & the deadly triad

Module 3 quiz & review

Project: CartPole with linear FA

Module 4. Module 4 — Deep Q-networks

Welcome to Module 4

From Q-learning to DQN

Experience replay & target networks

Double, dueling & prioritized replay

DQN hyperparameters & debugging

Module 4 quiz & review

Project: DQN on CartPole

Module 5. Module 5 — Policy gradients

Welcome to Module 5

Why learn policies directly

REINFORCE & the policy gradient theorem

Baseline & variance reduction

Actor–critic architecture

Module 5 quiz & review

Project: REINFORCE on CartPole

Module 6. Module 6 — Actor–critic & PPO

Welcome to Module 6

Advantage estimation & GAE

TRPO intuition

Proximal policy optimization

A2C & parallel RL

Module 6 quiz & review

Project: PPO on Lunar Lander

Module 7. Module 7 — Model-based RL

Welcome to Module 7

Planning with learned models

Dyna-Q & simulation

Monte Carlo tree search

World models & Dreamer (intro)

Module 7 quiz & review

Project: Dyna-Q gridworld

Module 8. Module 8 — Continuous control & robotics

Welcome to Module 8

Continuous action spaces

DDPG & deterministic policies

Soft actor–critic (SAC)

Sim-to-real & domain randomization

Robotics RL case studies

Module 8 quiz & review

Project: SAC on Pendulum

Module 9. Module 9 — Production & advanced topics

Welcome to Module 9

Offline RL & batch constraints

Exploration & intrinsic motivation

Multi-agent RL basics

Safety, alignment & deployment

Monitoring & evaluation in production

Module 9 quiz & review

Project: production RL serving