Agents, environments & the RL loop
Before we begin
Reinforcement learning is about learning from interaction, not from a fixed labeled dataset. An agent takes actions in an environment, receives rewards, and updates its behavior over time. There is no teacher saying “the correct action at step 47 was left” — only scalar feedback that may arrive late or be sparse.
This loop is the foundation of game-playing agents, robotics, recommendation systems, and large-scale alignment pipelines. Before any equation, you should be able to draw the loop and name what travels on each arrow.
State — what the agent observes about the world.
Action — what the agent does.
Reward — immediate feedback signal.
Policy — rule for choosing actions from states.
What you will learn
- Define state, action, reward, and policy in your own words.
- Draw the agent–environment loop and label each message.
- Contrast RL with supervised learning (no per-step correct labels).
- Recognize exploration vs exploitation in everyday and algorithmic decisions.
- Map Gymnasium API calls (
reset,step) to the theoretical loop.
The agent–environment loop
At each discrete time step t:
- The environment exposes state sₜ (or observation oₜ).
- The agent selects action aₜ using its policy π.
- The environment returns reward rₜ₊₁, next state sₜ₊₁, and a done flag.
| Message | Direction | Typical content |
|---|---|---|
| State / observation | Environment → Agent | Vector, image, game board |
| Action | Agent → Environment | Discrete index or continuous vector |
| Reward | Environment → Agent | Scalar (can be negative) |
| Done | Environment → Agent | Episode ended? |
Worked example: thermostat agent
| RL concept | Thermostat |
|---|---|
| State | Current room temperature, outdoor temp, time of day |
| Action | Heat on / heat off / fan only |
| Reward | Negative energy cost minus comfort penalty if too cold |
| Policy | Rule mapping readings → HVAC setting |
Checkpoint: In a chess engine trained with self-play, what is the state? The action? The reward?
Answer
State: board position (pieces, castling rights, en passant, side to move). Action: legal move from the move list. Reward: often 0 per move, +1 win, −1 loss, 0 draw — or shaped rewards in some curricula.
Mapping the loop to Gymnasium
Gymnasium standardizes the interface so algorithms swap environments without rewriting the training loop.
import gymnasium as gym
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)
total_reward = 0.0
terminated = truncated = False
while not (terminated or truncated):
action = env.action_space.sample() # random policy — replace with yours
obs, reward, terminated, truncated, info = env.step(action)
total_reward += reward
print("Episode return:", total_reward)| Gymnasium return | RL meaning |
|---|---|
obs | State / observation sₜ |
reward | Scalar rₜ₊₁ |
terminated | True terminal state (pole fell) |
truncated | Time limit hit — not necessarily “failure” |
info | Debug extras (not used for learning by default) |
RL vs supervised learning
| Aspect | Supervised learning | Reinforcement learning |
|---|---|---|
| Training signal | Correct label per example | Reward (may be delayed) |
| Data | Fixed dataset | Generated by interaction |
| Goal | Predict labels | Maximize cumulative reward |
| Credit assignment | Clear per sample | Which action caused later reward? |
A spam classifier learns from emails labeled spam/ham. An RL agent learns from consequences — which button led to more clicks, which move led to checkmate.
Exploration vs exploitation
The agent must exploit what it knows (pick the best-looking action) and explore (try alternatives that might be better). Pure exploitation gets stuck on a local optimum; pure exploration never earns reward.
| Strategy | Idea | When it shines |
|---|---|---|
| ε-greedy | Random action with probability ε | Simple bandits and Q-learning |
| Softmax / Boltzmann | Sample proportional to estimated values | Stochastic policies |
| Optimism (UCB) | Prefer uncertain arms | Multi-armed bandits |
Numeric sketch: Two restaurant options. You know A scores 8/10 from 20 visits; B is unknown. Exploitation picks A; exploration tries B once — maybe it is 9/10.
Common mistakes
- Calling the observation a state when it omits hidden information (velocity not in position-only obs) — breaks Markov assumptions later.
- Treating truncated the same as terminated when logging “success rate.”
- Assuming rewards must be positive — negative step costs are normal.
- Confusing policy (behavior) with value function (how good states are) — policies act; values evaluate.
You now have the vocabulary every RL paper uses: agent, environment, state, action, reward, policy, and the interaction loop. The next lesson formalizes this as a Markov decision process so we can write equations and compute optimal behavior.