← Back to curriculum

Module 1 — RL foundations & MDPs

Welcome — start here

RL vocabulary, how to read lessons, Module 1 roadmap, and what to install before the bandit project.

~30 min read + exercises

Welcome — start here

Before we begin

If you are new to reinforcement learning, you are in the right place. This course is standalone — you do not need the AI track first, though comfort with basic Python and (later) neural networks helps.

This page answers questions beginners actually have:

  • What do agent, environment, reward, and policy mean?
  • How is RL different from supervised machine learning?
  • What will Module 1 teach, and what do I install before the project?
  • How should I read each lesson so it sticks?

There is no assumed background in RL or optimal control. Module 1 builds intuition first, then a small multi-armed bandit project so you feel exploration and regret — not just read about them.


Key concepts (plain English)

Agent — The learner or decision-maker: a robot, game player, recommender, or trading strategy. It chooses actions.

Environment — Everything the agent interacts with: physics, users, market, game rules. It returns next state and reward.

State — A summary of “where we are now” — enough that the future does not depend on the full history (the Markov property). If your state is incomplete, learning gets harder.

Action — What the agent can do: move left, buy, throttle, click. Can be discrete (finite choices) or continuous (a real number or vector).

Reward — A scalar signal saying how good the last step was. Not always “win/lose” — often shaped from many small signals. The agent’s job is to maximize cumulative reward over time.

Policy — The agent’s strategy: mapping states to actions (deterministic or stochastic). Written π(a|s) = probability of action a in state s.

Value function — How good it is to be in a state (or to take an action in a state), measured as expected future return. Bellman equations connect values across time steps.

Exploration vs exploitation — Try new actions to discover better rewards, or use what already works? Every RL algorithm handles this tension differently.

IdeaSupervised MLReinforcement learning
Training signalCorrect label per exampleReward (often sparse, delayed)
DataFixed datasetInteraction generates data
GoalPredict labelsMaximize return over behavior

Figure

The agent–environment loop

Agentpolicy πEnvironmentaction astate s′, reward r
Observe state → choose action → receive reward and next state → repeat.

Figure

Module 1 at a glance

Module 1 — lesson flow1Welcome2Lesson 13Lesson 24Lesson 35Lesson 46Quiz7Project
Welcome, four core lessons, quiz, then hands-on bandit project.

What is this course?

Deep Reinforcement Learning walks from MDPs and tabular Q-learning through DQN, policy gradients, PPO, SAC, model-based RL, and production deployment — with quizzes and projects in every module.

Module 1 in one sentence

You will understand what RL optimizes, how MDPs model problems, and how value functions and Bellman equations underpin every algorithm that follows.

LessonTopic
1Agents, environments, the RL loop
2Markov decision processes
3Returns, discounting, episodes
4Bellman equations & value functions
Quiz20 MCQs with review links
ProjectMulti-armed bandit (ε-greedy & UCB1)

Who is this for?

Good fit if you:

  • Want to understand how AlphaGo, game-playing agents, or robot policies are trained — not just use them as black boxes.
  • Know basic Python and are willing to use Gymnasium for environments.
  • Prefer slow, descriptive lessons over bullet-only summaries.

Helpful but not required:

  • The AI course Modules 1–4 (gradients, neural nets) before Module 4 (DQN) of this track.
  • The Robotics Foundations track for Module 8 (continuous control).

How to read each lesson

  1. Read Before we begin and What you will learn.
  2. Answer checkpoint questions before peeking at answers.
  3. Work through numeric examples with paper or a calculator.
  4. Use What's next only when the current lesson feels solid.

Progress saves in this browser when you open a lesson.


What to install before the project

Lessons 1–5 are reading and thinking. Lesson 6 (project) requires code.

  • Python 3.10+python.org/downloads
  • pip install numpy matplotlib gymnasium
  • Any editor (VS Code, Cursor, etc.)

From Module 4 onward you will also use PyTorch. From Module 6, Stable-Baselines3 is recommended for PPO/SAC labs.


Full course roadmap

  1. RL foundations & MDPs — you are here
  2. Tabular methods (DP, MC, TD, Q-learning)
  3. Function approximation
  4. Deep Q-networks (DQN)
  5. Policy gradients (REINFORCE)
  6. Actor–critic & PPO
  7. Model-based RL & planning
  8. Continuous control & robotics RL
  9. Production & advanced topics

Focus on Module 1 for now.


Ready?

Lesson 1 — Agents, environments & the RL loop

Take your time. There is no deadline — only the goal of actually understanding each idea before moving on.