← All learning paths
Learning pathIntermediate~118 hours

Deep Reinforcement Learning

A complete, standalone path from MDPs and tabular Q-learning through DQN, policy gradients, PPO, SAC, model-based RL, and production deployment — with quizzes and portfolio-ready projects in every module.

What you'll learn

  • MDPs to deep RL
  • DQN & PPO
  • Robotics & sim-to-real
  • Hands-on projects

Your progress

0 / 66 lessons reached

Lessons in this path

Work top to bottom within each module, or jump in from the table of contents on each lesson page.

Module 1. Module 1 — RL foundations & MDPs

Agents, environments, Markov decision processes, returns, discounting, and Bellman equations — ending with a multi-armed bandit project.

  1. Lesson 130 min

    Welcome — start here

    RL vocabulary, how to read lessons, Module 1 roadmap, and what to install before the bandit project.

  2. Lesson 255 min

    Agents, environments & the RL loop

    State, action, reward, policy; exploration vs exploitation; contrast with supervised learning.

  3. Lesson 360 min

    Markov decision processes

    The MDP tuple, Markov property, transition dynamics, and mapping to Gymnasium spaces.

  4. Lesson 455 min

    Returns, discounting & episodes

    Discounted return, episodic vs continuing tasks, and numeric worked examples.

  5. Lesson 565 min

    Bellman equations & value functions

    V^π, Q^π, Bellman expectation and optimality, and greedy policies from Q*.

  6. Lesson 645 min

    Module 1 quiz & review

    20 interactive MCQs with instant feedback and lesson links for topics you miss.

  7. Lesson 790 min

    Project: multi-armed bandit

    Implement ε-greedy and UCB1, plot cumulative regret, compare exploration strategies.

Module 2. Module 2 — Tabular methods

Dynamic programming, Monte Carlo, temporal-difference learning, Q-learning, SARSA, and on- vs off-policy control — with a gridworld project.

  1. Lesson 820 min

    Welcome to Module 2

    How Module 2 builds on MDPs, lesson order, and prerequisites for the gridworld project.

  2. Lesson 970 min

    Dynamic programming — policy & value iteration

    Policy evaluation, policy iteration, value iteration, and when DP is tractable.

  3. Lesson 1065 min

    Monte Carlo methods

    First-visit and every-visit MC, episodic returns, and MC control with ε-soft policies.

  4. Lesson 1165 min

    Temporal-difference learning

    TD(0), bootstrapping vs MC, TD error, and n-step returns preview.

  5. Lesson 1270 min

    Q-learning & SARSA

    Off-policy Q-learning vs on-policy SARSA, update rules, and convergence intuition.

  6. Lesson 1350 min

    On-policy vs off-policy

    Behavior vs target policy, importance sampling preview, and algorithm choice.

  7. Lesson 1450 min

    Module 2 quiz & review

    20 MCQs on DP, MC, TD, Q-learning, and SARSA.

  8. Lesson 15120 min

    Project: gridworld Q-learning

    Train tabular Q-learning on a gridworld, visualize the greedy policy, tune ε and α.

Module 3. Module 3 — Function approximation

Why tables fail at scale, linear FA, neural nets as value approximators, and the deadly triad — with a CartPole linear-FA lab.

  1. Lesson 1620 min

    Welcome to Module 3

    Curse of dimensionality, when to approximate, and setup for the CartPole project.

  2. Lesson 1755 min

    Why tabular methods break

    State-space explosion, generalization, and partial observability preview.

  3. Lesson 1865 min

    Linear function approximation

    Feature vectors, linear V and Q, semi-gradient TD, and stability caveats.

  4. Lesson 1960 min

    Neural networks as approximators

    Nonlinear FA, shared representations, and batching transitions.

  5. Lesson 2060 min

    Instability & the deadly triad

    Function approximation + bootstrapping + off-policy — why DQN needs tricks.

  6. Lesson 2145 min

    Module 3 quiz & review

    20 MCQs on FA, linear methods, neural approximators, and instability.

  7. Lesson 22110 min

    Project: CartPole with linear FA

    Tile coding or polynomial features + semi-gradient TD on CartPole-v1.

Module 4. Module 4 — Deep Q-networks

From tabular Q-learning to DQN: experience replay, target networks, Double/Dueling/PER, debugging — with a CartPole DQN project.

  1. Lesson 2325 min

    Welcome to Module 4

    PyTorch setup, GPU optional, and how DQN stabilizes the deadly triad.

  2. Lesson 2465 min

    From Q-learning to DQN

    Neural Q-network, loss as MSE on TD target, and ε-greedy exploration.

  3. Lesson 2570 min

    Experience replay & target networks

    Replay buffer decorrelation, fixed targets, and soft target updates.

  4. Lesson 2665 min

    Double, dueling & prioritized replay

    Overestimation bias, advantage streams, and PER sampling.

  5. Lesson 2760 min

    DQN hyperparameters & debugging

    Learning rate, buffer size, target sync, reward scaling, and common failure modes.

  6. Lesson 2845 min

    Module 4 quiz & review

    20 MCQs on DQN components and debugging.

  7. Lesson 29150 min

    Project: DQN on CartPole

    PyTorch DQN with replay and target net; reach 200+ mean return on CartPole-v1.

Module 5. Module 5 — Policy gradients

REINFORCE, the policy gradient theorem, baselines, and actor–critic — with a REINFORCE CartPole project.

  1. Lesson 3020 min

    Welcome to Module 5

    Why policies beat Q-learning in continuous action spaces, and module roadmap.

  2. Lesson 3155 min

    Why learn policies directly

    Stochastic policies, continuous actions, and parameterizing π_θ(a|s).

  3. Lesson 3270 min

    REINFORCE & the policy gradient theorem

    Monte Carlo policy gradient, log-derivative trick, and episodic updates.

  4. Lesson 3360 min

    Baseline & variance reduction

    State-dependent baselines, advantage intuition, and reward-to-go.

  5. Lesson 3465 min

    Actor–critic architecture

    Two networks: policy actor and value critic; TD bootstrapping for critics.

  6. Lesson 3545 min

    Module 5 quiz & review

    20 MCQs on policy gradients and actor–critic.

  7. Lesson 36130 min

    Project: REINFORCE on CartPole

    Implement REINFORCE with baseline; plot return and policy entropy.

Module 6. Module 6 — Actor–critic & PPO

GAE, TRPO intuition, proximal policy optimization, A2C — with a Lunar Lander PPO project.

  1. Lesson 3720 min

    Welcome to Module 6

    Stable policy updates, clip objective preview, Stable-Baselines3 optional setup.

  2. Lesson 3865 min

    Advantage estimation & GAE

    TD residuals, λ-returns, and generalized advantage estimation.

  3. Lesson 3960 min

    TRPO intuition

    Trust regions, KL constraints, and why naive policy gradients destabilize.

  4. Lesson 4070 min

    Proximal policy optimization

    Clipped surrogate objective, multiple epochs per batch, and PPO hyperparameters.

  5. Lesson 4155 min

    A2C & parallel RL

    Synchronous workers, vectorized envs, and throughput vs sample efficiency.

  6. Lesson 4245 min

    Module 6 quiz & review

    20 MCQs on GAE, TRPO, PPO, and A2C.

  7. Lesson 43140 min

    Project: PPO on Lunar Lander

    Train PPO on LunarLander-v2; log clip fraction and solve threshold.

Module 7. Module 7 — Model-based RL

Learned dynamics, Dyna-Q, MCTS, world models — with a Dyna-Q gridworld project.

  1. Lesson 4420 min

    Welcome to Module 7

    Sample efficiency vs model error, when planning helps, and project setup.

  2. Lesson 4560 min

    Planning with learned models

    Model-based vs model-free trade-offs, rollout planning, and compounding error.

  3. Lesson 4665 min

    Dyna-Q & simulation

    Integrate model learning with Q-learning; planning steps per real step.

  4. Lesson 4770 min

    Monte Carlo tree search

    Selection, expansion, simulation, backprop; UCT; AlphaGo connection.

  5. Lesson 4860 min

    World models & Dreamer (intro)

    Learned latent dynamics, imagination rollouts, and Dreamer-style agents.

  6. Lesson 4945 min

    Module 7 quiz & review

    20 MCQs on model-based RL and planning.

  7. Lesson 50110 min

    Project: Dyna-Q gridworld

    Tabular Dyna-Q; compare sample efficiency with vs without planning steps.

Module 8. Module 8 — Continuous control & robotics

Continuous actions, DDPG, SAC, sim-to-real, robotics case studies — with a SAC Pendulum project.

  1. Lesson 5125 min

    Welcome to Module 8

    Continuous action spaces, MuJoCo/Gymnasium continuous envs, and robotics context.

  2. Lesson 5255 min

    Continuous action spaces

    Gaussian policies, tanh squashing, action bounds, and reparameterization.

  3. Lesson 5365 min

    DDPG & deterministic policies

    Actor–critic for continuous control, target networks, and exploration noise.

  4. Lesson 5470 min

    Soft actor–critic (SAC)

    Maximum entropy RL, twin Q critics, automatic temperature tuning.

  5. Lesson 5560 min

    Sim-to-real & domain randomization

    Reality gap, randomizing physics/visuals, and system identification preview.

  6. Lesson 5655 min

    Robotics RL case studies

    Manipulation benchmarks, sim stacks, and connecting to the robotics track.

  7. Lesson 5745 min

    Module 8 quiz & review

    20 MCQs on continuous control and sim-to-real.

  8. Lesson 58130 min

    Project: SAC on Pendulum

    Train SAC on Pendulum-v1; compare entropy and learning curves to DDPG.

Module 9. Module 9 — Production & advanced topics

Offline RL, exploration, multi-agent basics, safety, monitoring — with a production RL serving project.

  1. Lesson 5925 min

    Welcome to Module 9

    Deploying RL beyond notebooks: constraints, evals, and the capstone API project.

  2. Lesson 6060 min

    Offline RL & batch constraints

    CQL/BCQ intuition, distributional shift, and learning from logged data.

  3. Lesson 6155 min

    Exploration & intrinsic motivation

    Count-based, curiosity, RND, and sparse-reward environments.

  4. Lesson 6260 min

    Multi-agent RL basics

    Independent learners, non-stationarity, centralized training decentralized execution.

  5. Lesson 6355 min

    Safety, alignment & deployment

    Constrained RL, human oversight, reward hacking, and guardrails.

  6. Lesson 6455 min

    Monitoring & evaluation in production

    Online vs offline metrics, A/B tests, drift, and rollback strategies.

  7. Lesson 6545 min

    Module 9 quiz & review

    20 MCQs on production RL and advanced topics.

  8. Lesson 66180 min

    Project: production RL serving

    Serve a trained policy via FastAPI; health checks, batch inference, latency logging.