Exploration & intrinsic motivation
Before we begin
ε-greedy and entropy bonuses help, but sparse-reward domains — maze with goal in corner, hard exploration games — need structured exploration. Intrinsic motivation adds reward for novelty, prediction error, or empowerment so the agent discovers skills before task reward arrives.
Extrinsic reward — environment task signal rₜ.
Intrinsic reward — rᵢₜ from curiosity, count bonus, etc.
Exploration bonus — encourages visiting under-explored states.
What you will learn
- Classify exploration: random, optimism, count-based, intrinsic.
- Implement prediction-error curiosity (ICM-style intuition).
- Connect entropy regularization (SAC) to exploration.
- Recognize noisy TV problem — stochastic noise looks "novel".
- Choose exploration strategy by reward density and state representation.
Exploration taxonomy
| Method | Mechanism | Best when |
|---|---|---|
| ε-greedy | Random actions | Small discrete, dense reward |
| Boltzmann / softmax | Sample from Q | Discrete |
| OU / Gaussian noise | Continuous perturb | DDPG |
| Entropy (SAC) | Maximize H(π) | Continuous control |
| Count / pseudo-count | Bonus 1/√N(s) | Tabular / learned hash |
| Curiosity (ICM) | ‖f(s,a) − s′‖ error | Sparse, visual |
| RND | Random net feature novelty | Hard exploration games |
Count-based exploration
In tabular settings, bonus b(s) = β / √N(s) encourages rare states. Pseudo-counts from density models extend to high dimensions (CTS, PixelCNN — research-heavy).
Checkpoint: Why does count bonus fail on raw pixels without hashing?
Answer
Almost every frame is unique — counts never grow, bonus never decays. Need abstract state (downsample, learned embedding, ICM features) so semantically similar states share counts.
Curiosity-driven exploration (ICM sketch)
Train forward model f(s, a) ≈ s′ in feature space; intrinsic reward = prediction error ‖φ(s′) − f(φ(s), a)‖². Agent seeks states where dynamics are surprising — but not irreducible noise.
# Intrinsic reward sketch
phi_s = encoder(s)
phi_s_next = encoder(s_next)
pred = forward_model(phi_s, a)
r_intrinsic = (phi_s_next - pred).pow(2).mean()
total_reward = r_extrinsic + beta * r_intrinsicInverse model can filter uncontrollable noise — predict a from (φ(s), φ(s′)); only forward error on controllable features counts.
Random Network Distillation (RND)
Fixed random network R(s); train predictor R̂(s) to match R on visited states. Novel states have high ‖R − R̂‖ — bonus. Simpler than full dynamics model; worked on Montezuma's Revenge benchmarks.
| Pros | Cons |
|---|---|
| Easy to implement | Can chase stochastic noise |
| Scales with CNN features | Needs normalization |
Noisy TV and exploration traps
A random TV (irreducible noise) yields perpetual prediction error — agent stares at noise instead of exploring maze. Mitigations: disagreement ensembles, inverse models, reward normalization, episodic memory (count without global noise).
Exploration vs exploitation in production
| Research | Production |
|---|---|
| Maximize novelty | Constrain to safe actions |
| Long horizons | Limited regret per user |
| Sim resets | Real cost per trial |
Use shadow policies, contextual bandits, or small traffic slices for exploration. Offline logs from logging policy provide coverage — tie to Lesson 1.
Worked example: sparse grid goal
Extrinsic: +1 at goal, 0 elsewhere. ε-greedy may take O(grid size) episodes. Count bonus reaches goal in O(√visits) tabular regimes. ICM helps when observations are images of the grid.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| β too large intrinsic | Ignores task reward | Anneal β |
| Curiosity on noise | Agent frozen on TV | RND norm, ensemble |
| No episodic reset bonus | Same room re-novelty | Episodic counts |
| Exploration in unsafe sim | Dangerous real deploy | Safe sim + constraints |
| Same ε train/test | Wrong eval | Greedy / mean at eval |
Closing
Exploration is not one trick — match state representation and reward sparsity to the bonus. Intrinsic motivation bootstraps learning when extrinsic signal is rare; combine with offline data and safety filters before production-facing exploration.