Planning with learned models

Before we begin

Model-free RL learns values or policies directly from environment interaction. Model-based RL adds a learned or given dynamics model — a function that predicts what happens next — and uses that model for planning before acting. The promise: fewer real-world samples, better sample efficiency, and explicit reasoning about consequences.

Dynamics model — predicts next state and reward from current state and action: (s, a) → s′, r.
Planning — searching or simulating forward in the model to choose a better action now.
Model-free — no explicit transition model; learns Q or π from experience only.

What you will learn

Distinguish model-based vs model-free RL and when each shines.
Define one-step, multi-step, and latent dynamics models.
Explain planning with a learned model via rollouts, tree search, or MPC.
Recognize model bias — wrong models steer the agent into imaginary success.
Sketch how learned models connect to Dyna-Q, MCTS, and Dreamer later in this module.

The model-based vs model-free split

Approach	What you learn	Planning?	Sample efficiency
Model-free (Q-learning, PPO)	Q(s,a) or π(a	s)	Implicit in backups
Model-based	P(s′	s,a), R(s,a) or a simulator	Explicit rollouts / search
Hybrid (Dyna, MuZero)	Both model and value/policy	Real + imagined data	Best of both

Model-free is robust when the environment is hard to model (contact-rich robotics, human opponents) but can waste samples. Model-based shines when dynamics are smooth and learnable (pendulum, driving lanes) or when real interaction is expensive (hardware wear, clinical trials).

What counts as a model?

Tabular transitions — store counts for each (s, a, s′) pair.
Learned neural model — f_θ(s, a) → (s′, r).
Physics engine — MuJoCo, Isaac Sim; parameters may be randomized.
Latent model — predict in compressed representation z instead of raw pixels.

Learning a one-step dynamics model

Given a replay buffer of transitions (s, a, r, s′), train a network to minimize prediction error:

python

import torch
import torch.nn as nn
 
class OneStepModel(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim + act_dim, 256),
            nn.ReLU(),
            nn.Linear(256, obs_dim + 1),  # next_obs delta + reward
        )
 
    def forward(self, obs, action):
        x = torch.cat([obs, action], dim=-1)
        out = self.net(x)
        delta_obs = out[..., :-1]
        reward = out[..., -1]
        return obs + delta_obs, reward  # predict residual for stability

Residual prediction (predict Δs instead of s′) often trains more stably for continuous control. Stochastic models output a distribution over s′ to capture aleatoric uncertainty.

Worked example: 1D point mass

State = position x, action = force u. True dynamics: x′ = x + 0.1u.
After 1000 random transitions, a linear model fits nearly perfectly. Planning with this model — try u ∈ [-1, 0, 1], pick u that minimizes distance to goal — reaches the target in one step without further real interaction.

Checkpoint: If the learned model predicts x′ = x + 0.05u instead of 0.1u, what happens to planning?

Answer

The planner underestimates the effect of actions. It will choose larger or repeated forces in simulation that look optimal in the model but underperform in reality — classic model bias. The agent may also avoid actions the model extrapolates poorly.

Planning with rollouts

Random shooting: sample N action sequences of length H, simulate each in the model, score total return, execute the first action of the best sequence. Cross-entropy method (CEM) iteratively refits a distribution over action sequences toward high-return samples.

python

def plan_action(model, state, horizon=10, num_samples=200):
    best_return, best_first_action = -1e9, None
    for _ in range(num_samples):
        actions = np.random.uniform(-1, 1, size=(horizon, act_dim))
        s, G = state.copy(), 0.0
        for t in range(horizon):
            s, r = model.predict(s, actions[t])
            G += (0.99 ** t) * r
        if G > best_return:
            best_return, best_first_action = G, actions[0]
    return best_first_action

This is model predictive control (MPC) when replanned every step. Short horizons limit compounding error; replanning every step corrects drift.

Planning method	Idea	Cost
Brute-force rollouts	Sample action sequences	Grows with horizon × samples
MPC / CEM	Refine action distribution	Moderate; popular in robotics
Tree search	Branch on actions (next lesson)	High for large branching
Value + model (Dyna)	Model generates TD targets	Low per step

Compounding error and model bias

Errors in one-step prediction compound over multi-step rollouts. A 5% per-step error over 50 steps can place imagined states far from any real state the model was trained on — out-of-distribution inputs produce garbage predictions.

Mitigations:

Short horizons and replan every step.
Ensemble models — disagree across models signals uncertainty; avoid risky imagined paths.
Penalize uncertainty in planning objective.
Mix real and imagined data (Dyna-Q, Dreamer) so policy does not overfit to fantasy.

Common mistakes

Mistake	Symptom	Fix
Planning too far ahead with a weak model	Great simulated return, poor real return	Shorter horizon, ensembles, MPC
No normalization on states	Model loss diverges	Match train stats; normalize s, a
Deterministic model for stochastic env	Overconfident planner	Stochastic / probabilistic dynamics
Ignoring reward model error	Wrong trade-offs near goals	Jointly train r and s′; validate on holdout
No real data refresh	Policy exploits model holes	Continue collecting real transitions

Closing

Planning with learned models turns RL into learn + imagine + decide. The model is never perfect; production systems treat it as a sample multiplier and a what-if engine, not a replacement for grounding in real feedback. The next lessons add Dyna-Q (tabular hybrid), MCTS (tree search), and Dreamer (latent world models).

Before this lesson

Previous lesson

What's next

Next lesson — Dyna-Q & simulation