Multi-agent RL basics

Before we begin

Real systems involve multiple decision-makers: traffic, games, warehouse robots, auction ads. Multi-agent RL (MARL) studies interacting policies — cooperative, competitive, or mixed. Non-stationarity breaks single-agent assumptions: from one agent's view, the environment changes as others learn.

MARL — N agents, joint state, each has action; environment steps on joint action.
Non-stationarity — other agents' policies shift during training.
CTDE — Centralized Training, Decentralized Execution.

What you will learn

Define joint action, Nash equilibrium, cooperative vs competitive settings.
Explain independent Q-learning and why it can diverge.
Describe CTDE (MADDPG, QMIX) at high level.
Map applications: self-play, team games, multi-robot coordination.
Recognize evaluation pitfalls — who are you scoring?

Setting types

Type	Goal	Example
Fully cooperative	Shared team reward	Co-op games, flocking
Fully competitive	Zero-sum	Chess, two-player zero-sum
General-sum	Mixed incentives	Traffic, markets
Self-play	Train vs copies of self	AlphaZero

Partial observability common — each agent sees local sensors; CTDE uses global info at train time only.

Independent learners (IQL)

Each agent runs Q-learning or PPO treating others as part of the environment. Simple but non-stationary — Bellman assumption fails.

python

# Independent DQN — each agent i
for agent in agents:
    agent.replay.add(s_i, a_i, r_i, s_i_next)
    agent.update()  # others' policies changed mid-batch

Can work in practice with large batches and slow opponents but no convergence guarantee.

Checkpoint: In cooperative soccer, should each agent optimize its own reward or team reward?

Answer

Team reward aligns incentives — credit assignment is harder (who caused the goal?). Difference rewards or value decomposition (QMIX) attribute team Q to individuals.

CTDE — centralized training, decentralized execution

Training: critic or mixer sees global state and all actions.
Execution: each policy uses only local observation.

Algorithm	Idea
MADDPG	Centralized critics, decentralized actors
QMIX	Monotonic mix of per-agent Q into Q_tot
MAPPO	Multi-agent PPO with shared or centralized critic

QMIX enforces ∂Q_tot/∂Qᵢ ≥ 0 so local greedy improves team Q.

Self-play curriculum

Train policy π against past versions of itself (league training). Prevents cycling — rock-paper-scissors strategies without progress. AlphaStar and OpenAI Five used league + priors.

Benefit	Risk
Automatic curriculum	Exploit sim-only bugs
No human data	Mode collapse without diversity

Worked example: two-agent grid

Agents must simultaneously reach switches to open a door. Joint action space |A|² — factored Q with QMIX learns coordination faster than flat Q on product space for moderate N.

Evaluation and metrics

Team return vs individual return — specify which.
Seeds and pairings — eval against held-out partners, not only co-trained.
Ad-hoc teamwork — train with diverse partners for zero-shot coordination (research).

Common mistakes

Mistake	Symptom	Fix
IQL in competitive game	Oscillation, no convergence	Self-play / CTDE
Shared reward, no credit	Lazy agent	Difference reward / QMIX
Eval only with training partners	Overfit coordination	Partner population
Centralized exec at deploy	Sensor mismatch	Decentralized actors only
Ignoring communication cost	Sim-only telepathy	Limit bandwidth at train

Closing

MARL adds interaction to the RL problem — non-stationarity and credit assignment dominate algorithm choice. CTDE and self-play are workhorses; safety and monitoring (next lessons) matter when multiple agents act in production with human stakeholders.

Before this lesson

Previous lesson

What's next

Next lesson — Safety, alignment & deployment