← Back to curriculum

Module 9 — Production & advanced topics

Multi-agent RL basics

Independent learners, non-stationarity, centralized training decentralized execution.

~60 min read + exercises

Multi-agent RL basics

Before we begin

Real systems involve multiple decision-makers: traffic, games, warehouse robots, auction ads. Multi-agent RL (MARL) studies interacting policies — cooperative, competitive, or mixed. Non-stationarity breaks single-agent assumptions: from one agent's view, the environment changes as others learn.

MARL — N agents, joint state, each has action; environment steps on joint action.
Non-stationarity — other agents' policies shift during training.
CTDE — Centralized Training, Decentralized Execution.


What you will learn

  • Define joint action, Nash equilibrium, cooperative vs competitive settings.
  • Explain independent Q-learning and why it can diverge.
  • Describe CTDE (MADDPG, QMIX) at high level.
  • Map applications: self-play, team games, multi-robot coordination.
  • Recognize evaluation pitfalls — who are you scoring?

Setting types

TypeGoalExample
Fully cooperativeShared team rewardCo-op games, flocking
Fully competitiveZero-sumChess, two-player zero-sum
General-sumMixed incentivesTraffic, markets
Self-playTrain vs copies of selfAlphaZero

Partial observability common — each agent sees local sensors; CTDE uses global info at train time only.


Independent learners (IQL)

Each agent runs Q-learning or PPO treating others as part of the environment. Simple but non-stationary — Bellman assumption fails.

python
# Independent DQN — each agent i
for agent in agents:
    agent.replay.add(s_i, a_i, r_i, s_i_next)
    agent.update()  # others' policies changed mid-batch

Can work in practice with large batches and slow opponents but no convergence guarantee.

Checkpoint: In cooperative soccer, should each agent optimize its own reward or team reward?

Answer

Team reward aligns incentives — credit assignment is harder (who caused the goal?). Difference rewards or value decomposition (QMIX) attribute team Q to individuals.


CTDE — centralized training, decentralized execution

Training: critic or mixer sees global state and all actions.
Execution: each policy uses only local observation.

AlgorithmIdea
MADDPGCentralized critics, decentralized actors
QMIXMonotonic mix of per-agent Q into Q_tot
MAPPOMulti-agent PPO with shared or centralized critic

QMIX enforces ∂Q_tot/∂Qᵢ ≥ 0 so local greedy improves team Q.


Self-play curriculum

Train policy π against past versions of itself (league training). Prevents cycling — rock-paper-scissors strategies without progress. AlphaStar and OpenAI Five used league + priors.

BenefitRisk
Automatic curriculumExploit sim-only bugs
No human dataMode collapse without diversity

Worked example: two-agent grid

Agents must simultaneously reach switches to open a door. Joint action space |A|² — factored Q with QMIX learns coordination faster than flat Q on product space for moderate N.


Evaluation and metrics

  • Team return vs individual return — specify which.
  • Seeds and pairings — eval against held-out partners, not only co-trained.
  • Ad-hoc teamwork — train with diverse partners for zero-shot coordination (research).

Common mistakes

MistakeSymptomFix
IQL in competitive gameOscillation, no convergenceSelf-play / CTDE
Shared reward, no creditLazy agentDifference reward / QMIX
Eval only with training partnersOverfit coordinationPartner population
Centralized exec at deploySensor mismatchDecentralized actors only
Ignoring communication costSim-only telepathyLimit bandwidth at train

Closing

MARL adds interaction to the RL problem — non-stationarity and credit assignment dominate algorithm choice. CTDE and self-play are workhorses; safety and monitoring (next lessons) matter when multiple agents act in production with human stakeholders.


Before this lesson


What's next