Deep Reinforcement Learning
A complete, standalone path from MDPs and tabular Q-learning through DQN, policy gradients, PPO, SAC, model-based RL, and production deployment — with quizzes and portfolio-ready projects in every module.
What you'll learn
- MDPs to deep RL
- DQN & PPO
- Robotics & sim-to-real
- Hands-on projects
Your progress
0 / 66 lessons reached
Lessons in this path
Work top to bottom within each module, or jump in from the table of contents on each lesson page.
Module 1. Module 1 — RL foundations & MDPs
Agents, environments, Markov decision processes, returns, discounting, and Bellman equations — ending with a multi-armed bandit project.
- Lesson 130 min
Welcome — start here
RL vocabulary, how to read lessons, Module 1 roadmap, and what to install before the bandit project.
- Lesson 255 min
Agents, environments & the RL loop
State, action, reward, policy; exploration vs exploitation; contrast with supervised learning.
- Lesson 360 min
Markov decision processes
The MDP tuple, Markov property, transition dynamics, and mapping to Gymnasium spaces.
- Lesson 455 min
Returns, discounting & episodes
Discounted return, episodic vs continuing tasks, and numeric worked examples.
- Lesson 565 min
Bellman equations & value functions
V^π, Q^π, Bellman expectation and optimality, and greedy policies from Q*.
- Lesson 645 min
Module 1 quiz & review
20 interactive MCQs with instant feedback and lesson links for topics you miss.
- Lesson 790 min
Project: multi-armed bandit
Implement ε-greedy and UCB1, plot cumulative regret, compare exploration strategies.
Module 2. Module 2 — Tabular methods
Dynamic programming, Monte Carlo, temporal-difference learning, Q-learning, SARSA, and on- vs off-policy control — with a gridworld project.
- Lesson 820 min
Welcome to Module 2
How Module 2 builds on MDPs, lesson order, and prerequisites for the gridworld project.
- Lesson 970 min
Dynamic programming — policy & value iteration
Policy evaluation, policy iteration, value iteration, and when DP is tractable.
- Lesson 1065 min
Monte Carlo methods
First-visit and every-visit MC, episodic returns, and MC control with ε-soft policies.
- Lesson 1165 min
Temporal-difference learning
TD(0), bootstrapping vs MC, TD error, and n-step returns preview.
- Lesson 1270 min
Q-learning & SARSA
Off-policy Q-learning vs on-policy SARSA, update rules, and convergence intuition.
- Lesson 1350 min
On-policy vs off-policy
Behavior vs target policy, importance sampling preview, and algorithm choice.
- Lesson 1450 min
Module 2 quiz & review
20 MCQs on DP, MC, TD, Q-learning, and SARSA.
- Lesson 15120 min
Project: gridworld Q-learning
Train tabular Q-learning on a gridworld, visualize the greedy policy, tune ε and α.
Module 3. Module 3 — Function approximation
Why tables fail at scale, linear FA, neural nets as value approximators, and the deadly triad — with a CartPole linear-FA lab.
- Lesson 1620 min
Welcome to Module 3
Curse of dimensionality, when to approximate, and setup for the CartPole project.
- Lesson 1755 min
Why tabular methods break
State-space explosion, generalization, and partial observability preview.
- Lesson 1865 min
Linear function approximation
Feature vectors, linear V and Q, semi-gradient TD, and stability caveats.
- Lesson 1960 min
Neural networks as approximators
Nonlinear FA, shared representations, and batching transitions.
- Lesson 2060 min
Instability & the deadly triad
Function approximation + bootstrapping + off-policy — why DQN needs tricks.
- Lesson 2145 min
Module 3 quiz & review
20 MCQs on FA, linear methods, neural approximators, and instability.
- Lesson 22110 min
Project: CartPole with linear FA
Tile coding or polynomial features + semi-gradient TD on CartPole-v1.
Module 4. Module 4 — Deep Q-networks
From tabular Q-learning to DQN: experience replay, target networks, Double/Dueling/PER, debugging — with a CartPole DQN project.
- Lesson 2325 min
Welcome to Module 4
PyTorch setup, GPU optional, and how DQN stabilizes the deadly triad.
- Lesson 2465 min
From Q-learning to DQN
Neural Q-network, loss as MSE on TD target, and ε-greedy exploration.
- Lesson 2570 min
Experience replay & target networks
Replay buffer decorrelation, fixed targets, and soft target updates.
- Lesson 2665 min
Double, dueling & prioritized replay
Overestimation bias, advantage streams, and PER sampling.
- Lesson 2760 min
DQN hyperparameters & debugging
Learning rate, buffer size, target sync, reward scaling, and common failure modes.
- Lesson 2845 min
Module 4 quiz & review
20 MCQs on DQN components and debugging.
- Lesson 29150 min
Project: DQN on CartPole
PyTorch DQN with replay and target net; reach 200+ mean return on CartPole-v1.
Module 5. Module 5 — Policy gradients
REINFORCE, the policy gradient theorem, baselines, and actor–critic — with a REINFORCE CartPole project.
- Lesson 3020 min
Welcome to Module 5
Why policies beat Q-learning in continuous action spaces, and module roadmap.
- Lesson 3155 min
Why learn policies directly
Stochastic policies, continuous actions, and parameterizing π_θ(a|s).
- Lesson 3270 min
REINFORCE & the policy gradient theorem
Monte Carlo policy gradient, log-derivative trick, and episodic updates.
- Lesson 3360 min
Baseline & variance reduction
State-dependent baselines, advantage intuition, and reward-to-go.
- Lesson 3465 min
Actor–critic architecture
Two networks: policy actor and value critic; TD bootstrapping for critics.
- Lesson 3545 min
Module 5 quiz & review
20 MCQs on policy gradients and actor–critic.
- Lesson 36130 min
Project: REINFORCE on CartPole
Implement REINFORCE with baseline; plot return and policy entropy.
Module 6. Module 6 — Actor–critic & PPO
GAE, TRPO intuition, proximal policy optimization, A2C — with a Lunar Lander PPO project.
- Lesson 3720 min
Welcome to Module 6
Stable policy updates, clip objective preview, Stable-Baselines3 optional setup.
- Lesson 3865 min
Advantage estimation & GAE
TD residuals, λ-returns, and generalized advantage estimation.
- Lesson 3960 min
TRPO intuition
Trust regions, KL constraints, and why naive policy gradients destabilize.
- Lesson 4070 min
Proximal policy optimization
Clipped surrogate objective, multiple epochs per batch, and PPO hyperparameters.
- Lesson 4155 min
A2C & parallel RL
Synchronous workers, vectorized envs, and throughput vs sample efficiency.
- Lesson 4245 min
Module 6 quiz & review
20 MCQs on GAE, TRPO, PPO, and A2C.
- Lesson 43140 min
Project: PPO on Lunar Lander
Train PPO on LunarLander-v2; log clip fraction and solve threshold.
Module 7. Module 7 — Model-based RL
Learned dynamics, Dyna-Q, MCTS, world models — with a Dyna-Q gridworld project.
- Lesson 4420 min
Welcome to Module 7
Sample efficiency vs model error, when planning helps, and project setup.
- Lesson 4560 min
Planning with learned models
Model-based vs model-free trade-offs, rollout planning, and compounding error.
- Lesson 4665 min
Dyna-Q & simulation
Integrate model learning with Q-learning; planning steps per real step.
- Lesson 4770 min
Monte Carlo tree search
Selection, expansion, simulation, backprop; UCT; AlphaGo connection.
- Lesson 4860 min
World models & Dreamer (intro)
Learned latent dynamics, imagination rollouts, and Dreamer-style agents.
- Lesson 4945 min
Module 7 quiz & review
20 MCQs on model-based RL and planning.
- Lesson 50110 min
Project: Dyna-Q gridworld
Tabular Dyna-Q; compare sample efficiency with vs without planning steps.
Module 8. Module 8 — Continuous control & robotics
Continuous actions, DDPG, SAC, sim-to-real, robotics case studies — with a SAC Pendulum project.
- Lesson 5125 min
Welcome to Module 8
Continuous action spaces, MuJoCo/Gymnasium continuous envs, and robotics context.
- Lesson 5255 min
Continuous action spaces
Gaussian policies, tanh squashing, action bounds, and reparameterization.
- Lesson 5365 min
DDPG & deterministic policies
Actor–critic for continuous control, target networks, and exploration noise.
- Lesson 5470 min
Soft actor–critic (SAC)
Maximum entropy RL, twin Q critics, automatic temperature tuning.
- Lesson 5560 min
Sim-to-real & domain randomization
Reality gap, randomizing physics/visuals, and system identification preview.
- Lesson 5655 min
Robotics RL case studies
Manipulation benchmarks, sim stacks, and connecting to the robotics track.
- Lesson 5745 min
Module 8 quiz & review
20 MCQs on continuous control and sim-to-real.
- Lesson 58130 min
Project: SAC on Pendulum
Train SAC on Pendulum-v1; compare entropy and learning curves to DDPG.
Module 9. Module 9 — Production & advanced topics
Offline RL, exploration, multi-agent basics, safety, monitoring — with a production RL serving project.
- Lesson 5925 min
Welcome to Module 9
Deploying RL beyond notebooks: constraints, evals, and the capstone API project.
- Lesson 6060 min
Offline RL & batch constraints
CQL/BCQ intuition, distributional shift, and learning from logged data.
- Lesson 6155 min
Exploration & intrinsic motivation
Count-based, curiosity, RND, and sparse-reward environments.
- Lesson 6260 min
Multi-agent RL basics
Independent learners, non-stationarity, centralized training decentralized execution.
- Lesson 6355 min
Safety, alignment & deployment
Constrained RL, human oversight, reward hacking, and guardrails.
- Lesson 6455 min
Monitoring & evaluation in production
Online vs offline metrics, A/B tests, drift, and rollback strategies.
- Lesson 6545 min
Module 9 quiz & review
20 MCQs on production RL and advanced topics.
- Lesson 66180 min
Project: production RL serving
Serve a trained policy via FastAPI; health checks, batch inference, latency logging.