Welcome — start here

Before we begin

If you are new to reinforcement learning, you are in the right place. This course is standalone — you do not need the AI track first, though comfort with basic Python and (later) neural networks helps.

This page answers questions beginners actually have:

What do agent, environment, reward, and policy mean?
How is RL different from supervised machine learning?
What will Module 1 teach, and what do I install before the project?
How should I read each lesson so it sticks?

There is no assumed background in RL or optimal control. Module 1 builds intuition first, then a small multi-armed bandit project so you feel exploration and regret — not just read about them.

Key concepts (plain English)

Agent — The learner or decision-maker: a robot, game player, recommender, or trading strategy. It chooses actions.

Environment — Everything the agent interacts with: physics, users, market, game rules. It returns next state and reward.

State — A summary of “where we are now” — enough that the future does not depend on the full history (the Markov property). If your state is incomplete, learning gets harder.

Action — What the agent can do: move left, buy, throttle, click. Can be discrete (finite choices) or continuous (a real number or vector).

Reward — A scalar signal saying how good the last step was. Not always “win/lose” — often shaped from many small signals. The agent’s job is to maximize cumulative reward over time.

Policy — The agent’s strategy: mapping states to actions (deterministic or stochastic). Written π(a|s) = probability of action a in state s.

Value function — How good it is to be in a state (or to take an action in a state), measured as expected future return. Bellman equations connect values across time steps.

Exploration vs exploitation — Try new actions to discover better rewards, or use what already works? Every RL algorithm handles this tension differently.

Idea	Supervised ML	Reinforcement learning
Training signal	Correct label per example	Reward (often sparse, delayed)
Data	Fixed dataset	Interaction generates data
Goal	Predict labels	Maximize return over behavior

Figure

The agent–environment loop

Observe state → choose action → receive reward and next state → repeat.

Figure

Module 1 at a glance

Welcome, four core lessons, quiz, then hands-on bandit project.

What is this course?

Deep Reinforcement Learning walks from MDPs and tabular Q-learning through DQN, policy gradients, PPO, SAC, model-based RL, and production deployment — with quizzes and projects in every module.

Module 1 in one sentence

You will understand what RL optimizes, how MDPs model problems, and how value functions and Bellman equations underpin every algorithm that follows.

Lesson	Topic
1	Agents, environments, the RL loop
2	Markov decision processes
3	Returns, discounting, episodes
4	Bellman equations & value functions
Quiz	20 MCQs with review links
Project	Multi-armed bandit (ε-greedy & UCB1)

Who is this for?

Good fit if you:

Want to understand how AlphaGo, game-playing agents, or robot policies are trained — not just use them as black boxes.
Know basic Python and are willing to use Gymnasium for environments.
Prefer slow, descriptive lessons over bullet-only summaries.

Helpful but not required:

The AI course Modules 1–4 (gradients, neural nets) before Module 4 (DQN) of this track.
The Robotics Foundations track for Module 8 (continuous control).

How to read each lesson

Read Before we begin and What you will learn.
Answer checkpoint questions before peeking at answers.
Work through numeric examples with paper or a calculator.
Use What's next only when the current lesson feels solid.

Progress saves in this browser when you open a lesson.

What to install before the project

Lessons 1–5 are reading and thinking. Lesson 6 (project) requires code.

Python 3.10+ — python.org/downloads
pip install numpy matplotlib gymnasium
Any editor (VS Code, Cursor, etc.)

From Module 4 onward you will also use PyTorch. From Module 6, Stable-Baselines3 is recommended for PPO/SAC labs.

Full course roadmap

RL foundations & MDPs — you are here
Tabular methods (DP, MC, TD, Q-learning)
Function approximation
Deep Q-networks (DQN)
Policy gradients (REINFORCE)
Actor–critic & PPO
Model-based RL & planning
Continuous control & robotics RL
Production & advanced topics

Focus on Module 1 for now.

Ready?

Lesson 1 — Agents, environments & the RL loop

Take your time. There is no deadline — only the goal of actually understanding each idea before moving on.