Agents, environments & the RL loop

Before we begin

Reinforcement learning is about learning from interaction, not from a fixed labeled dataset. An agent takes actions in an environment, receives rewards, and updates its behavior over time. There is no teacher saying “the correct action at step 47 was left” — only scalar feedback that may arrive late or be sparse.

This loop is the foundation of game-playing agents, robotics, recommendation systems, and large-scale alignment pipelines. Before any equation, you should be able to draw the loop and name what travels on each arrow.

State — what the agent observes about the world.
Action — what the agent does.
Reward — immediate feedback signal.
Policy — rule for choosing actions from states.

What you will learn

Define state, action, reward, and policy in your own words.
Draw the agent–environment loop and label each message.
Contrast RL with supervised learning (no per-step correct labels).
Recognize exploration vs exploitation in everyday and algorithmic decisions.
Map Gymnasium API calls (reset, step) to the theoretical loop.

The agent–environment loop

At each discrete time step t:

The environment exposes state sₜ (or observation oₜ).
The agent selects action aₜ using its policy π.
The environment returns reward rₜ₊₁, next state sₜ₊₁, and a done flag.

Message	Direction	Typical content
State / observation	Environment → Agent	Vector, image, game board
Action	Agent → Environment	Discrete index or continuous vector
Reward	Environment → Agent	Scalar (can be negative)
Done	Environment → Agent	Episode ended?

Worked example: thermostat agent

RL concept	Thermostat
State	Current room temperature, outdoor temp, time of day
Action	Heat on / heat off / fan only
Reward	Negative energy cost minus comfort penalty if too cold
Policy	Rule mapping readings → HVAC setting

Checkpoint: In a chess engine trained with self-play, what is the state? The action? The reward?

Answer

State: board position (pieces, castling rights, en passant, side to move). Action: legal move from the move list. Reward: often 0 per move, +1 win, −1 loss, 0 draw — or shaped rewards in some curricula.

Mapping the loop to Gymnasium

Gymnasium standardizes the interface so algorithms swap environments without rewriting the training loop.

python

import gymnasium as gym
 
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)
 
total_reward = 0.0
terminated = truncated = False
 
while not (terminated or truncated):
    action = env.action_space.sample()  # random policy — replace with yours
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
 
print("Episode return:", total_reward)

Gymnasium return	RL meaning
`obs`	State / observation sₜ
`reward`	Scalar rₜ₊₁
`terminated`	True terminal state (pole fell)
`truncated`	Time limit hit — not necessarily “failure”
`info`	Debug extras (not used for learning by default)

RL vs supervised learning

Aspect	Supervised learning	Reinforcement learning
Training signal	Correct label per example	Reward (may be delayed)
Data	Fixed dataset	Generated by interaction
Goal	Predict labels	Maximize cumulative reward
Credit assignment	Clear per sample	Which action caused later reward?

A spam classifier learns from emails labeled spam/ham. An RL agent learns from consequences — which button led to more clicks, which move led to checkmate.

Exploration vs exploitation

The agent must exploit what it knows (pick the best-looking action) and explore (try alternatives that might be better). Pure exploitation gets stuck on a local optimum; pure exploration never earns reward.

Strategy	Idea	When it shines
ε-greedy	Random action with probability ε	Simple bandits and Q-learning
Softmax / Boltzmann	Sample proportional to estimated values	Stochastic policies
Optimism (UCB)	Prefer uncertain arms	Multi-armed bandits

Numeric sketch: Two restaurant options. You know A scores 8/10 from 20 visits; B is unknown. Exploitation picks A; exploration tries B once — maybe it is 9/10.

Common mistakes

Calling the observation a state when it omits hidden information (velocity not in position-only obs) — breaks Markov assumptions later.
Treating truncated the same as terminated when logging “success rate.”
Assuming rewards must be positive — negative step costs are normal.
Confusing policy (behavior) with value function (how good states are) — policies act; values evaluate.

You now have the vocabulary every RL paper uses: agent, environment, state, action, reward, policy, and the interaction loop. The next lesson formalizes this as a Markov decision process so we can write equations and compute optimal behavior.

Before this lesson

Previous lesson

What's next

Next lesson — Markov decision processes