← Back to curriculum

Module 1 — RL foundations & MDPs

Agents, environments & the RL loop

State, action, reward, policy; exploration vs exploitation; contrast with supervised learning.

~55 min read + exercises

Agents, environments & the RL loop

Before we begin

Reinforcement learning is about learning from interaction, not from a fixed labeled dataset. An agent takes actions in an environment, receives rewards, and updates its behavior over time. There is no teacher saying “the correct action at step 47 was left” — only scalar feedback that may arrive late or be sparse.

This loop is the foundation of game-playing agents, robotics, recommendation systems, and large-scale alignment pipelines. Before any equation, you should be able to draw the loop and name what travels on each arrow.

State — what the agent observes about the world.
Action — what the agent does.
Reward — immediate feedback signal.
Policy — rule for choosing actions from states.


What you will learn

  • Define state, action, reward, and policy in your own words.
  • Draw the agent–environment loop and label each message.
  • Contrast RL with supervised learning (no per-step correct labels).
  • Recognize exploration vs exploitation in everyday and algorithmic decisions.
  • Map Gymnasium API calls (reset, step) to the theoretical loop.

The agent–environment loop

At each discrete time step t:

  1. The environment exposes state sₜ (or observation oₜ).
  2. The agent selects action aₜ using its policy π.
  3. The environment returns reward rₜ₊₁, next state sₜ₊₁, and a done flag.
MessageDirectionTypical content
State / observationEnvironment → AgentVector, image, game board
ActionAgent → EnvironmentDiscrete index or continuous vector
RewardEnvironment → AgentScalar (can be negative)
DoneEnvironment → AgentEpisode ended?

Worked example: thermostat agent

RL conceptThermostat
StateCurrent room temperature, outdoor temp, time of day
ActionHeat on / heat off / fan only
RewardNegative energy cost minus comfort penalty if too cold
PolicyRule mapping readings → HVAC setting

Checkpoint: In a chess engine trained with self-play, what is the state? The action? The reward?

Answer

State: board position (pieces, castling rights, en passant, side to move). Action: legal move from the move list. Reward: often 0 per move, +1 win, −1 loss, 0 draw — or shaped rewards in some curricula.


Mapping the loop to Gymnasium

Gymnasium standardizes the interface so algorithms swap environments without rewriting the training loop.

python
import gymnasium as gym
 
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)
 
total_reward = 0.0
terminated = truncated = False
 
while not (terminated or truncated):
    action = env.action_space.sample()  # random policy — replace with yours
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
 
print("Episode return:", total_reward)
Gymnasium returnRL meaning
obsState / observation sₜ
rewardScalar rₜ₊₁
terminatedTrue terminal state (pole fell)
truncatedTime limit hit — not necessarily “failure”
infoDebug extras (not used for learning by default)

RL vs supervised learning

AspectSupervised learningReinforcement learning
Training signalCorrect label per exampleReward (may be delayed)
DataFixed datasetGenerated by interaction
GoalPredict labelsMaximize cumulative reward
Credit assignmentClear per sampleWhich action caused later reward?

A spam classifier learns from emails labeled spam/ham. An RL agent learns from consequences — which button led to more clicks, which move led to checkmate.


Exploration vs exploitation

The agent must exploit what it knows (pick the best-looking action) and explore (try alternatives that might be better). Pure exploitation gets stuck on a local optimum; pure exploration never earns reward.

StrategyIdeaWhen it shines
ε-greedyRandom action with probability εSimple bandits and Q-learning
Softmax / BoltzmannSample proportional to estimated valuesStochastic policies
Optimism (UCB)Prefer uncertain armsMulti-armed bandits

Numeric sketch: Two restaurant options. You know A scores 8/10 from 20 visits; B is unknown. Exploitation picks A; exploration tries B once — maybe it is 9/10.


Common mistakes

  • Calling the observation a state when it omits hidden information (velocity not in position-only obs) — breaks Markov assumptions later.
  • Treating truncated the same as terminated when logging “success rate.”
  • Assuming rewards must be positive — negative step costs are normal.
  • Confusing policy (behavior) with value function (how good states are) — policies act; values evaluate.

You now have the vocabulary every RL paper uses: agent, environment, state, action, reward, policy, and the interaction loop. The next lesson formalizes this as a Markov decision process so we can write equations and compute optimal behavior.


Before this lesson


What's next