World models & Dreamer (intro)
Before we begin
World models learn a compressed representation of the environment and predict future representations instead of raw pixels. Dreamer trains a latent dynamics model, imagines rollouts entirely in latent space, and learns a policy from those dreams — achieving strong sample efficiency on visual control tasks. This lesson is an intro to the architecture and training loop, not a full reproduction.
World model — neural model of environment dynamics, often in latent space z.
RSSM — Recurrent State Space Model; core of Dreamer v1–v3.
Imagination — policy rollouts inside the learned model without env interaction.
What you will learn
- Motivate latent dynamics when states are high-dimensional images.
- Decompose Dreamer into encoder, dynamics, decoder, reward head, critic.
- Explain the imagination training loop for actor–critic in latent space.
- Compare world models to one-step pixel predictors and MPC.
- List practical limits: stochasticity, long horizons, sim-to-real gap.
Why not predict pixels directly?
Predicting next frames in RGB is hard — shadows, textures, irrelevant detail. Representation learning maps observation oₜ to latent zₜ; dynamics predict zₜ₊₁ from (zₜ, aₜ). The decoder reconstructs ôₜ only for training signal.
| Approach | Predicts | Pros | Cons |
|---|---|---|---|
| Pixel model | oₜ₊₁ pixels | Interpretable | Blurry, high-dimensional |
| Latent model (RSSM) | zₜ₊₁ | Efficient planning | Encoder errors |
| No decoder (MuZero) | z only | Very compact | Less interpretable |
Dreamer uses RSSM: deterministic path hₜ (GRU) + stochastic zₜ sampled from a distribution — captures partial observability and uncertainty.
Dreamer components (high level)
- Encoder — CNN: oₜ → embed.
- RSSM — recurrent state (hₜ, zₜ); transition p(zₜ₊₁ | hₜ, aₜ).
- Decoder — (hₜ, zₜ) → reconstructed ôₜ.
- Reward / continue heads — predict rₜ and whether episode continues.
- Actor–critic — trained on imagined trajectories of length H in latent space.
Training alternates:
- World model learning on real buffer: reconstruction + reward + KL regularization.
- Behavior learning — roll out actor in imagined latents; maximize λ-return with critic.
# Conceptual imagination loop (not full Dreamer API)
for _ in range(imagination_horizon):
action = actor(latent_state)
latent_state = rssm.imagine_step(latent_state, action)
reward_pred = reward_head(latent_state)
values.append(critic(latent_state))
# actor loss from lambda-return of reward_pred sequenceWorked example: sample efficiency intuition
On DeepMind Control cheetah run from pixels:
| Agent | Steps to threshold return | Notes |
|---|---|---|
| PPO (pixels) | ~10M+ | Model-free baseline |
| DreamerV3 | ~1M often competitive | Imagined actor–critic |
| SAC (state) | Lower if proprioception | Not fair vs pixels |
World models trade compute per step (train model + imagine) for fewer env steps — valuable when steps are money or danger.
Checkpoint: If the decoder reconstructs well but the cheetah falls in reality, which component failed?
Answer
Likely dynamics or reward head in latent space — reconstruction can look good while z does not encode physics-critical features (velocity, contact). Or the actor overfits imagined rollouts that diverge from real latents. Check imagined vs real latent distributions and reward prediction error.
Training objectives (simplified)
| Loss | Purpose |
|---|---|
| Reconstruction | ô − o |
| Reward prediction | Align with task signal |
| KL(posterior ‖ prior) | Regularize stochastic z; prevent collapse |
| Continue / discount | Model episode termination |
| Actor–critic on dreams | Task performance without env |
KL balancing is delicate: too weak → poor latents; too strong → dynamics ignore observations.
Dreamer family timeline
| Version | Notable improvement |
|---|---|
| DreamerV1 | Latent imagination for control |
| DreamerV2 | More stable scaling |
| DreamerV3 | Single config across 150+ tasks |
Open implementations exist (e.g. dreamerv3 repos); read hyperparameters for imagination horizon H and batch fractions (model vs actor updates).
Limits and open problems
- Compounding error in imagination — same issue as tabular models, worse in pixels.
- Stochastic environments — need accurate uncertainty; else actor exploits model noise.
- Real robots — world model must capture contacts; sim pre-training common.
- Compute — wall-clock may exceed simple PPO unless env is slow.
World models also enable counterfactual reasoning ("what if I had braked earlier?") for safety analysis — touched in Module 9.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Training actor only on real data | Loses sample-efficiency benefit | Imagination rollouts |
| Huge H with weak dynamics | Actor learns fantasy | Shorter H; KL / ensemble |
| Skipping continue head | Imagined episodes never end | Predict discount / done |
| Evaluating on train env only | Overfit visuals | New seeds, backgrounds |
| Confusing Dreamer with model-free SAC | Wrong baseline | Match observation type |
Closing
Dreamer shows that model-based RL at scale means learning to dream in latent space, then optimizing behavior inside those dreams. You now have the full Module 7 arc: explicit models + Dyna, tree search, and latent world models. Pick the planner that matches your action space, state representation, and cost of real interaction.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.