← Back to curriculum

Module 7 — Model-based RL

Planning with learned models

Model-based vs model-free trade-offs, rollout planning, and compounding error.

~60 min read + exercises

Planning with learned models

Before we begin

Model-free RL learns values or policies directly from environment interaction. Model-based RL adds a learned or given dynamics model — a function that predicts what happens next — and uses that model for planning before acting. The promise: fewer real-world samples, better sample efficiency, and explicit reasoning about consequences.

Dynamics model — predicts next state and reward from current state and action: (s, a) → s′, r.
Planning — searching or simulating forward in the model to choose a better action now.
Model-free — no explicit transition model; learns Q or π from experience only.


What you will learn

  • Distinguish model-based vs model-free RL and when each shines.
  • Define one-step, multi-step, and latent dynamics models.
  • Explain planning with a learned model via rollouts, tree search, or MPC.
  • Recognize model bias — wrong models steer the agent into imaginary success.
  • Sketch how learned models connect to Dyna-Q, MCTS, and Dreamer later in this module.

The model-based vs model-free split

ApproachWhat you learnPlanning?Sample efficiency
Model-free (Q-learning, PPO)Q(s,a) or π(as)Implicit in backups
Model-basedP(s′s,a), R(s,a) or a simulatorExplicit rollouts / search
Hybrid (Dyna, MuZero)Both model and value/policyReal + imagined dataBest of both

Model-free is robust when the environment is hard to model (contact-rich robotics, human opponents) but can waste samples. Model-based shines when dynamics are smooth and learnable (pendulum, driving lanes) or when real interaction is expensive (hardware wear, clinical trials).

What counts as a model?

  1. Tabular transitions — store counts for each (s, a, s′) pair.
  2. Learned neural model — f_θ(s, a) → (s′, r).
  3. Physics engine — MuJoCo, Isaac Sim; parameters may be randomized.
  4. Latent model — predict in compressed representation z instead of raw pixels.

Learning a one-step dynamics model

Given a replay buffer of transitions (s, a, r, s′), train a network to minimize prediction error:

python
import torch
import torch.nn as nn
 
class OneStepModel(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim + act_dim, 256),
            nn.ReLU(),
            nn.Linear(256, obs_dim + 1),  # next_obs delta + reward
        )
 
    def forward(self, obs, action):
        x = torch.cat([obs, action], dim=-1)
        out = self.net(x)
        delta_obs = out[..., :-1]
        reward = out[..., -1]
        return obs + delta_obs, reward  # predict residual for stability

Residual prediction (predict Δs instead of s′) often trains more stably for continuous control. Stochastic models output a distribution over s′ to capture aleatoric uncertainty.

Worked example: 1D point mass

State = position x, action = force u. True dynamics: x′ = x + 0.1u.
After 1000 random transitions, a linear model fits nearly perfectly. Planning with this model — try u ∈ [-1, 0, 1], pick u that minimizes distance to goal — reaches the target in one step without further real interaction.

Checkpoint: If the learned model predicts x′ = x + 0.05u instead of 0.1u, what happens to planning?

Answer

The planner underestimates the effect of actions. It will choose larger or repeated forces in simulation that look optimal in the model but underperform in reality — classic model bias. The agent may also avoid actions the model extrapolates poorly.


Planning with rollouts

Random shooting: sample N action sequences of length H, simulate each in the model, score total return, execute the first action of the best sequence. Cross-entropy method (CEM) iteratively refits a distribution over action sequences toward high-return samples.

python
def plan_action(model, state, horizon=10, num_samples=200):
    best_return, best_first_action = -1e9, None
    for _ in range(num_samples):
        actions = np.random.uniform(-1, 1, size=(horizon, act_dim))
        s, G = state.copy(), 0.0
        for t in range(horizon):
            s, r = model.predict(s, actions[t])
            G += (0.99 ** t) * r
        if G > best_return:
            best_return, best_first_action = G, actions[0]
    return best_first_action

This is model predictive control (MPC) when replanned every step. Short horizons limit compounding error; replanning every step corrects drift.

Planning methodIdeaCost
Brute-force rolloutsSample action sequencesGrows with horizon × samples
MPC / CEMRefine action distributionModerate; popular in robotics
Tree searchBranch on actions (next lesson)High for large branching
Value + model (Dyna)Model generates TD targetsLow per step

Compounding error and model bias

Errors in one-step prediction compound over multi-step rollouts. A 5% per-step error over 50 steps can place imagined states far from any real state the model was trained on — out-of-distribution inputs produce garbage predictions.

Mitigations:

  • Short horizons and replan every step.
  • Ensemble models — disagree across models signals uncertainty; avoid risky imagined paths.
  • Penalize uncertainty in planning objective.
  • Mix real and imagined data (Dyna-Q, Dreamer) so policy does not overfit to fantasy.

Common mistakes

MistakeSymptomFix
Planning too far ahead with a weak modelGreat simulated return, poor real returnShorter horizon, ensembles, MPC
No normalization on statesModel loss divergesMatch train stats; normalize s, a
Deterministic model for stochastic envOverconfident plannerStochastic / probabilistic dynamics
Ignoring reward model errorWrong trade-offs near goalsJointly train r and s′; validate on holdout
No real data refreshPolicy exploits model holesContinue collecting real transitions

Closing

Planning with learned models turns RL into learn + imagine + decide. The model is never perfect; production systems treat it as a sample multiplier and a what-if engine, not a replacement for grounding in real feedback. The next lessons add Dyna-Q (tabular hybrid), MCTS (tree search), and Dreamer (latent world models).


Before this lesson


What's next