World models & Dreamer (intro)

Before we begin

World models learn a compressed representation of the environment and predict future representations instead of raw pixels. Dreamer trains a latent dynamics model, imagines rollouts entirely in latent space, and learns a policy from those dreams — achieving strong sample efficiency on visual control tasks. This lesson is an intro to the architecture and training loop, not a full reproduction.

World model — neural model of environment dynamics, often in latent space z.
RSSM — Recurrent State Space Model; core of Dreamer v1–v3.
Imagination — policy rollouts inside the learned model without env interaction.

What you will learn

Motivate latent dynamics when states are high-dimensional images.
Decompose Dreamer into encoder, dynamics, decoder, reward head, critic.
Explain the imagination training loop for actor–critic in latent space.
Compare world models to one-step pixel predictors and MPC.
List practical limits: stochasticity, long horizons, sim-to-real gap.

Why not predict pixels directly?

Predicting next frames in RGB is hard — shadows, textures, irrelevant detail. Representation learning maps observation oₜ to latent zₜ; dynamics predict zₜ₊₁ from (zₜ, aₜ). The decoder reconstructs ôₜ only for training signal.

Approach	Predicts	Pros	Cons
Pixel model	oₜ₊₁ pixels	Interpretable	Blurry, high-dimensional
Latent model (RSSM)	zₜ₊₁	Efficient planning	Encoder errors
No decoder (MuZero)	z only	Very compact	Less interpretable

Dreamer uses RSSM: deterministic path hₜ (GRU) + stochastic zₜ sampled from a distribution — captures partial observability and uncertainty.

Dreamer components (high level)

Encoder — CNN: oₜ → embed.
RSSM — recurrent state (hₜ, zₜ); transition p(zₜ₊₁ | hₜ, aₜ).
Decoder — (hₜ, zₜ) → reconstructed ôₜ.
Reward / continue heads — predict rₜ and whether episode continues.
Actor–critic — trained on imagined trajectories of length H in latent space.

Training alternates:

World model learning on real buffer: reconstruction + reward + KL regularization.
Behavior learning — roll out actor in imagined latents; maximize λ-return with critic.

python

# Conceptual imagination loop (not full Dreamer API)
for _ in range(imagination_horizon):
    action = actor(latent_state)
    latent_state = rssm.imagine_step(latent_state, action)
    reward_pred = reward_head(latent_state)
    values.append(critic(latent_state))
# actor loss from lambda-return of reward_pred sequence

Worked example: sample efficiency intuition

On DeepMind Control cheetah run from pixels:

Agent	Steps to threshold return	Notes
PPO (pixels)	~10M+	Model-free baseline
DreamerV3	~1M often competitive	Imagined actor–critic
SAC (state)	Lower if proprioception	Not fair vs pixels

World models trade compute per step (train model + imagine) for fewer env steps — valuable when steps are money or danger.

Checkpoint: If the decoder reconstructs well but the cheetah falls in reality, which component failed?

Answer

Likely dynamics or reward head in latent space — reconstruction can look good while z does not encode physics-critical features (velocity, contact). Or the actor overfits imagined rollouts that diverge from real latents. Check imagined vs real latent distributions and reward prediction error.

Training objectives (simplified)

Loss	Purpose
Reconstruction	ô − o
Reward prediction	Align with task signal
KL(posterior ‖ prior)	Regularize stochastic z; prevent collapse
Continue / discount	Model episode termination
Actor–critic on dreams	Task performance without env

KL balancing is delicate: too weak → poor latents; too strong → dynamics ignore observations.

Dreamer family timeline

Version	Notable improvement
DreamerV1	Latent imagination for control
DreamerV2	More stable scaling
DreamerV3	Single config across 150+ tasks

Open implementations exist (e.g. dreamerv3 repos); read hyperparameters for imagination horizon H and batch fractions (model vs actor updates).

Limits and open problems

Compounding error in imagination — same issue as tabular models, worse in pixels.
Stochastic environments — need accurate uncertainty; else actor exploits model noise.
Real robots — world model must capture contacts; sim pre-training common.
Compute — wall-clock may exceed simple PPO unless env is slow.

World models also enable counterfactual reasoning ("what if I had braked earlier?") for safety analysis — touched in Module 9.

Common mistakes

Mistake	Symptom	Fix
Training actor only on real data	Loses sample-efficiency benefit	Imagination rollouts
Huge H with weak dynamics	Actor learns fantasy	Shorter H; KL / ensemble
Skipping continue head	Imagined episodes never end	Predict discount / done
Evaluating on train env only	Overfit visuals	New seeds, backgrounds
Confusing Dreamer with model-free SAC	Wrong baseline	Match observation type

Closing

Dreamer shows that model-based RL at scale means learning to dream in latent space, then optimizing behavior inside those dreams. You now have the full Module 7 arc: explicit models + Dyna, tree search, and latent world models. Pick the planner that matches your action space, state representation, and cost of real interaction.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.