Welcome — start here
Before we begin
If you are new to reinforcement learning, you are in the right place. This course is standalone — you do not need the AI track first, though comfort with basic Python and (later) neural networks helps.
This page answers questions beginners actually have:
- What do agent, environment, reward, and policy mean?
- How is RL different from supervised machine learning?
- What will Module 1 teach, and what do I install before the project?
- How should I read each lesson so it sticks?
There is no assumed background in RL or optimal control. Module 1 builds intuition first, then a small multi-armed bandit project so you feel exploration and regret — not just read about them.
Key concepts (plain English)
Agent — The learner or decision-maker: a robot, game player, recommender, or trading strategy. It chooses actions.
Environment — Everything the agent interacts with: physics, users, market, game rules. It returns next state and reward.
State — A summary of “where we are now” — enough that the future does not depend on the full history (the Markov property). If your state is incomplete, learning gets harder.
Action — What the agent can do: move left, buy, throttle, click. Can be discrete (finite choices) or continuous (a real number or vector).
Reward — A scalar signal saying how good the last step was. Not always “win/lose” — often shaped from many small signals. The agent’s job is to maximize cumulative reward over time.
Policy — The agent’s strategy: mapping states to actions (deterministic or stochastic). Written π(a|s) = probability of action a in state s.
Value function — How good it is to be in a state (or to take an action in a state), measured as expected future return. Bellman equations connect values across time steps.
Exploration vs exploitation — Try new actions to discover better rewards, or use what already works? Every RL algorithm handles this tension differently.
| Idea | Supervised ML | Reinforcement learning |
|---|---|---|
| Training signal | Correct label per example | Reward (often sparse, delayed) |
| Data | Fixed dataset | Interaction generates data |
| Goal | Predict labels | Maximize return over behavior |
Figure
The agent–environment loop
Figure
Module 1 at a glance
What is this course?
Deep Reinforcement Learning walks from MDPs and tabular Q-learning through DQN, policy gradients, PPO, SAC, model-based RL, and production deployment — with quizzes and projects in every module.
Module 1 in one sentence
You will understand what RL optimizes, how MDPs model problems, and how value functions and Bellman equations underpin every algorithm that follows.
| Lesson | Topic |
|---|---|
| 1 | Agents, environments, the RL loop |
| 2 | Markov decision processes |
| 3 | Returns, discounting, episodes |
| 4 | Bellman equations & value functions |
| Quiz | 20 MCQs with review links |
| Project | Multi-armed bandit (ε-greedy & UCB1) |
Who is this for?
Good fit if you:
- Want to understand how AlphaGo, game-playing agents, or robot policies are trained — not just use them as black boxes.
- Know basic Python and are willing to use Gymnasium for environments.
- Prefer slow, descriptive lessons over bullet-only summaries.
Helpful but not required:
- The AI course Modules 1–4 (gradients, neural nets) before Module 4 (DQN) of this track.
- The Robotics Foundations track for Module 8 (continuous control).
How to read each lesson
- Read Before we begin and What you will learn.
- Answer checkpoint questions before peeking at answers.
- Work through numeric examples with paper or a calculator.
- Use What's next only when the current lesson feels solid.
Progress saves in this browser when you open a lesson.
What to install before the project
Lessons 1–5 are reading and thinking. Lesson 6 (project) requires code.
- Python 3.10+ — python.org/downloads
pip install numpy matplotlib gymnasium- Any editor (VS Code, Cursor, etc.)
From Module 4 onward you will also use PyTorch. From Module 6, Stable-Baselines3 is recommended for PPO/SAC labs.
Full course roadmap
- RL foundations & MDPs — you are here
- Tabular methods (DP, MC, TD, Q-learning)
- Function approximation
- Deep Q-networks (DQN)
- Policy gradients (REINFORCE)
- Actor–critic & PPO
- Model-based RL & planning
- Continuous control & robotics RL
- Production & advanced topics
Focus on Module 1 for now.
Ready?
Lesson 1 — Agents, environments & the RL loop
Take your time. There is no deadline — only the goal of actually understanding each idea before moving on.