Multi-agent RL basics
Before we begin
Real systems involve multiple decision-makers: traffic, games, warehouse robots, auction ads. Multi-agent RL (MARL) studies interacting policies — cooperative, competitive, or mixed. Non-stationarity breaks single-agent assumptions: from one agent's view, the environment changes as others learn.
MARL — N agents, joint state, each has action; environment steps on joint action.
Non-stationarity — other agents' policies shift during training.
CTDE — Centralized Training, Decentralized Execution.
What you will learn
- Define joint action, Nash equilibrium, cooperative vs competitive settings.
- Explain independent Q-learning and why it can diverge.
- Describe CTDE (MADDPG, QMIX) at high level.
- Map applications: self-play, team games, multi-robot coordination.
- Recognize evaluation pitfalls — who are you scoring?
Setting types
| Type | Goal | Example |
|---|---|---|
| Fully cooperative | Shared team reward | Co-op games, flocking |
| Fully competitive | Zero-sum | Chess, two-player zero-sum |
| General-sum | Mixed incentives | Traffic, markets |
| Self-play | Train vs copies of self | AlphaZero |
Partial observability common — each agent sees local sensors; CTDE uses global info at train time only.
Independent learners (IQL)
Each agent runs Q-learning or PPO treating others as part of the environment. Simple but non-stationary — Bellman assumption fails.
# Independent DQN — each agent i
for agent in agents:
agent.replay.add(s_i, a_i, r_i, s_i_next)
agent.update() # others' policies changed mid-batchCan work in practice with large batches and slow opponents but no convergence guarantee.
Checkpoint: In cooperative soccer, should each agent optimize its own reward or team reward?
Answer
Team reward aligns incentives — credit assignment is harder (who caused the goal?). Difference rewards or value decomposition (QMIX) attribute team Q to individuals.
CTDE — centralized training, decentralized execution
Training: critic or mixer sees global state and all actions.
Execution: each policy uses only local observation.
| Algorithm | Idea |
|---|---|
| MADDPG | Centralized critics, decentralized actors |
| QMIX | Monotonic mix of per-agent Q into Q_tot |
| MAPPO | Multi-agent PPO with shared or centralized critic |
QMIX enforces ∂Q_tot/∂Qᵢ ≥ 0 so local greedy improves team Q.
Self-play curriculum
Train policy π against past versions of itself (league training). Prevents cycling — rock-paper-scissors strategies without progress. AlphaStar and OpenAI Five used league + priors.
| Benefit | Risk |
|---|---|
| Automatic curriculum | Exploit sim-only bugs |
| No human data | Mode collapse without diversity |
Worked example: two-agent grid
Agents must simultaneously reach switches to open a door. Joint action space |A|² — factored Q with QMIX learns coordination faster than flat Q on product space for moderate N.
Evaluation and metrics
- Team return vs individual return — specify which.
- Seeds and pairings — eval against held-out partners, not only co-trained.
- Ad-hoc teamwork — train with diverse partners for zero-shot coordination (research).
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| IQL in competitive game | Oscillation, no convergence | Self-play / CTDE |
| Shared reward, no credit | Lazy agent | Difference reward / QMIX |
| Eval only with training partners | Overfit coordination | Partner population |
| Centralized exec at deploy | Sensor mismatch | Decentralized actors only |
| Ignoring communication cost | Sim-only telepathy | Limit bandwidth at train |
Closing
MARL adds interaction to the RL problem — non-stationarity and credit assignment dominate algorithm choice. CTDE and self-play are workhorses; safety and monitoring (next lessons) matter when multiple agents act in production with human stakeholders.