Planning with learned models
Before we begin
Model-free RL learns values or policies directly from environment interaction. Model-based RL adds a learned or given dynamics model — a function that predicts what happens next — and uses that model for planning before acting. The promise: fewer real-world samples, better sample efficiency, and explicit reasoning about consequences.
Dynamics model — predicts next state and reward from current state and action: (s, a) → s′, r.
Planning — searching or simulating forward in the model to choose a better action now.
Model-free — no explicit transition model; learns Q or π from experience only.
What you will learn
- Distinguish model-based vs model-free RL and when each shines.
- Define one-step, multi-step, and latent dynamics models.
- Explain planning with a learned model via rollouts, tree search, or MPC.
- Recognize model bias — wrong models steer the agent into imaginary success.
- Sketch how learned models connect to Dyna-Q, MCTS, and Dreamer later in this module.
The model-based vs model-free split
| Approach | What you learn | Planning? | Sample efficiency |
|---|---|---|---|
| Model-free (Q-learning, PPO) | Q(s,a) or π(a | s) | Implicit in backups |
| Model-based | P(s′ | s,a), R(s,a) or a simulator | Explicit rollouts / search |
| Hybrid (Dyna, MuZero) | Both model and value/policy | Real + imagined data | Best of both |
Model-free is robust when the environment is hard to model (contact-rich robotics, human opponents) but can waste samples. Model-based shines when dynamics are smooth and learnable (pendulum, driving lanes) or when real interaction is expensive (hardware wear, clinical trials).
What counts as a model?
- Tabular transitions — store counts for each (s, a, s′) pair.
- Learned neural model — f_θ(s, a) → (s′, r).
- Physics engine — MuJoCo, Isaac Sim; parameters may be randomized.
- Latent model — predict in compressed representation z instead of raw pixels.
Learning a one-step dynamics model
Given a replay buffer of transitions (s, a, r, s′), train a network to minimize prediction error:
import torch
import torch.nn as nn
class OneStepModel(nn.Module):
def __init__(self, obs_dim, act_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim + act_dim, 256),
nn.ReLU(),
nn.Linear(256, obs_dim + 1), # next_obs delta + reward
)
def forward(self, obs, action):
x = torch.cat([obs, action], dim=-1)
out = self.net(x)
delta_obs = out[..., :-1]
reward = out[..., -1]
return obs + delta_obs, reward # predict residual for stabilityResidual prediction (predict Δs instead of s′) often trains more stably for continuous control. Stochastic models output a distribution over s′ to capture aleatoric uncertainty.
Worked example: 1D point mass
State = position x, action = force u. True dynamics: x′ = x + 0.1u.
After 1000 random transitions, a linear model fits nearly perfectly. Planning with this model — try u ∈ [-1, 0, 1], pick u that minimizes distance to goal — reaches the target in one step without further real interaction.
Checkpoint: If the learned model predicts x′ = x + 0.05u instead of 0.1u, what happens to planning?
Answer
The planner underestimates the effect of actions. It will choose larger or repeated forces in simulation that look optimal in the model but underperform in reality — classic model bias. The agent may also avoid actions the model extrapolates poorly.
Planning with rollouts
Random shooting: sample N action sequences of length H, simulate each in the model, score total return, execute the first action of the best sequence. Cross-entropy method (CEM) iteratively refits a distribution over action sequences toward high-return samples.
def plan_action(model, state, horizon=10, num_samples=200):
best_return, best_first_action = -1e9, None
for _ in range(num_samples):
actions = np.random.uniform(-1, 1, size=(horizon, act_dim))
s, G = state.copy(), 0.0
for t in range(horizon):
s, r = model.predict(s, actions[t])
G += (0.99 ** t) * r
if G > best_return:
best_return, best_first_action = G, actions[0]
return best_first_actionThis is model predictive control (MPC) when replanned every step. Short horizons limit compounding error; replanning every step corrects drift.
| Planning method | Idea | Cost |
|---|---|---|
| Brute-force rollouts | Sample action sequences | Grows with horizon × samples |
| MPC / CEM | Refine action distribution | Moderate; popular in robotics |
| Tree search | Branch on actions (next lesson) | High for large branching |
| Value + model (Dyna) | Model generates TD targets | Low per step |
Compounding error and model bias
Errors in one-step prediction compound over multi-step rollouts. A 5% per-step error over 50 steps can place imagined states far from any real state the model was trained on — out-of-distribution inputs produce garbage predictions.
Mitigations:
- Short horizons and replan every step.
- Ensemble models — disagree across models signals uncertainty; avoid risky imagined paths.
- Penalize uncertainty in planning objective.
- Mix real and imagined data (Dyna-Q, Dreamer) so policy does not overfit to fantasy.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Planning too far ahead with a weak model | Great simulated return, poor real return | Shorter horizon, ensembles, MPC |
| No normalization on states | Model loss diverges | Match train stats; normalize s, a |
| Deterministic model for stochastic env | Overconfident planner | Stochastic / probabilistic dynamics |
| Ignoring reward model error | Wrong trade-offs near goals | Jointly train r and s′; validate on holdout |
| No real data refresh | Policy exploits model holes | Continue collecting real transitions |
Closing
Planning with learned models turns RL into learn + imagine + decide. The model is never perfect; production systems treat it as a sample multiplier and a what-if engine, not a replacement for grounding in real feedback. The next lessons add Dyna-Q (tabular hybrid), MCTS (tree search), and Dreamer (latent world models).