← Back to curriculum

Module 7 — Model-based RL

World models & Dreamer (intro)

Learned latent dynamics, imagination rollouts, and Dreamer-style agents.

~60 min read + exercises

World models & Dreamer (intro)

Before we begin

World models learn a compressed representation of the environment and predict future representations instead of raw pixels. Dreamer trains a latent dynamics model, imagines rollouts entirely in latent space, and learns a policy from those dreams — achieving strong sample efficiency on visual control tasks. This lesson is an intro to the architecture and training loop, not a full reproduction.

World model — neural model of environment dynamics, often in latent space z.
RSSM — Recurrent State Space Model; core of Dreamer v1–v3.
Imagination — policy rollouts inside the learned model without env interaction.


What you will learn

  • Motivate latent dynamics when states are high-dimensional images.
  • Decompose Dreamer into encoder, dynamics, decoder, reward head, critic.
  • Explain the imagination training loop for actor–critic in latent space.
  • Compare world models to one-step pixel predictors and MPC.
  • List practical limits: stochasticity, long horizons, sim-to-real gap.

Why not predict pixels directly?

Predicting next frames in RGB is hard — shadows, textures, irrelevant detail. Representation learning maps observation oₜ to latent zₜ; dynamics predict zₜ₊₁ from (zₜ, aₜ). The decoder reconstructs ôₜ only for training signal.

ApproachPredictsProsCons
Pixel modeloₜ₊₁ pixelsInterpretableBlurry, high-dimensional
Latent model (RSSM)zₜ₊₁Efficient planningEncoder errors
No decoder (MuZero)z onlyVery compactLess interpretable

Dreamer uses RSSM: deterministic path hₜ (GRU) + stochastic zₜ sampled from a distribution — captures partial observability and uncertainty.


Dreamer components (high level)

  1. Encoder — CNN: oₜ → embed.
  2. RSSM — recurrent state (hₜ, zₜ); transition p(zₜ₊₁ | hₜ, aₜ).
  3. Decoder — (hₜ, zₜ) → reconstructed ôₜ.
  4. Reward / continue heads — predict rₜ and whether episode continues.
  5. Actor–critic — trained on imagined trajectories of length H in latent space.

Training alternates:

  • World model learning on real buffer: reconstruction + reward + KL regularization.
  • Behavior learning — roll out actor in imagined latents; maximize λ-return with critic.
python
# Conceptual imagination loop (not full Dreamer API)
for _ in range(imagination_horizon):
    action = actor(latent_state)
    latent_state = rssm.imagine_step(latent_state, action)
    reward_pred = reward_head(latent_state)
    values.append(critic(latent_state))
# actor loss from lambda-return of reward_pred sequence

Worked example: sample efficiency intuition

On DeepMind Control cheetah run from pixels:

AgentSteps to threshold returnNotes
PPO (pixels)~10M+Model-free baseline
DreamerV3~1M often competitiveImagined actor–critic
SAC (state)Lower if proprioceptionNot fair vs pixels

World models trade compute per step (train model + imagine) for fewer env steps — valuable when steps are money or danger.

Checkpoint: If the decoder reconstructs well but the cheetah falls in reality, which component failed?

Answer

Likely dynamics or reward head in latent space — reconstruction can look good while z does not encode physics-critical features (velocity, contact). Or the actor overfits imagined rollouts that diverge from real latents. Check imagined vs real latent distributions and reward prediction error.


Training objectives (simplified)

LossPurpose
Reconstructionô − o
Reward predictionAlign with task signal
KL(posterior ‖ prior)Regularize stochastic z; prevent collapse
Continue / discountModel episode termination
Actor–critic on dreamsTask performance without env

KL balancing is delicate: too weak → poor latents; too strong → dynamics ignore observations.


Dreamer family timeline

VersionNotable improvement
DreamerV1Latent imagination for control
DreamerV2More stable scaling
DreamerV3Single config across 150+ tasks

Open implementations exist (e.g. dreamerv3 repos); read hyperparameters for imagination horizon H and batch fractions (model vs actor updates).


Limits and open problems

  • Compounding error in imagination — same issue as tabular models, worse in pixels.
  • Stochastic environments — need accurate uncertainty; else actor exploits model noise.
  • Real robots — world model must capture contacts; sim pre-training common.
  • Compute — wall-clock may exceed simple PPO unless env is slow.

World models also enable counterfactual reasoning ("what if I had braked earlier?") for safety analysis — touched in Module 9.


Common mistakes

MistakeSymptomFix
Training actor only on real dataLoses sample-efficiency benefitImagination rollouts
Huge H with weak dynamicsActor learns fantasyShorter H; KL / ensemble
Skipping continue headImagined episodes never endPredict discount / done
Evaluating on train env onlyOverfit visualsNew seeds, backgrounds
Confusing Dreamer with model-free SACWrong baselineMatch observation type

Closing

Dreamer shows that model-based RL at scale means learning to dream in latent space, then optimizing behavior inside those dreams. You now have the full Module 7 arc: explicit models + Dyna, tree search, and latent world models. Pick the planner that matches your action space, state representation, and cost of real interaction.


Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.