← Back to curriculum

Module 9 — Production & advanced topics

Exploration & intrinsic motivation

Count-based, curiosity, RND, and sparse-reward environments.

~55 min read + exercises

Exploration & intrinsic motivation

Before we begin

ε-greedy and entropy bonuses help, but sparse-reward domains — maze with goal in corner, hard exploration games — need structured exploration. Intrinsic motivation adds reward for novelty, prediction error, or empowerment so the agent discovers skills before task reward arrives.

Extrinsic reward — environment task signal rₜ.
Intrinsic reward — rᵢₜ from curiosity, count bonus, etc.
Exploration bonus — encourages visiting under-explored states.


What you will learn

  • Classify exploration: random, optimism, count-based, intrinsic.
  • Implement prediction-error curiosity (ICM-style intuition).
  • Connect entropy regularization (SAC) to exploration.
  • Recognize noisy TV problem — stochastic noise looks "novel".
  • Choose exploration strategy by reward density and state representation.

Exploration taxonomy

MethodMechanismBest when
ε-greedyRandom actionsSmall discrete, dense reward
Boltzmann / softmaxSample from QDiscrete
OU / Gaussian noiseContinuous perturbDDPG
Entropy (SAC)Maximize H(π)Continuous control
Count / pseudo-countBonus 1/√N(s)Tabular / learned hash
Curiosity (ICM)‖f(s,a) − s′‖ errorSparse, visual
RNDRandom net feature noveltyHard exploration games

Count-based exploration

In tabular settings, bonus b(s) = β / √N(s) encourages rare states. Pseudo-counts from density models extend to high dimensions (CTS, PixelCNN — research-heavy).

Checkpoint: Why does count bonus fail on raw pixels without hashing?

Answer

Almost every frame is unique — counts never grow, bonus never decays. Need abstract state (downsample, learned embedding, ICM features) so semantically similar states share counts.


Curiosity-driven exploration (ICM sketch)

Train forward model f(s, a) ≈ s′ in feature space; intrinsic reward = prediction error ‖φ(s′) − f(φ(s), a)‖². Agent seeks states where dynamics are surprising — but not irreducible noise.

python
# Intrinsic reward sketch
phi_s = encoder(s)
phi_s_next = encoder(s_next)
pred = forward_model(phi_s, a)
r_intrinsic = (phi_s_next - pred).pow(2).mean()
total_reward = r_extrinsic + beta * r_intrinsic

Inverse model can filter uncontrollable noise — predict a from (φ(s), φ(s′)); only forward error on controllable features counts.


Random Network Distillation (RND)

Fixed random network R(s); train predictor R̂(s) to match R on visited states. Novel states have high ‖R − R̂‖ — bonus. Simpler than full dynamics model; worked on Montezuma's Revenge benchmarks.

ProsCons
Easy to implementCan chase stochastic noise
Scales with CNN featuresNeeds normalization

Noisy TV and exploration traps

A random TV (irreducible noise) yields perpetual prediction error — agent stares at noise instead of exploring maze. Mitigations: disagreement ensembles, inverse models, reward normalization, episodic memory (count without global noise).


Exploration vs exploitation in production

ResearchProduction
Maximize noveltyConstrain to safe actions
Long horizonsLimited regret per user
Sim resetsReal cost per trial

Use shadow policies, contextual bandits, or small traffic slices for exploration. Offline logs from logging policy provide coverage — tie to Lesson 1.


Worked example: sparse grid goal

Extrinsic: +1 at goal, 0 elsewhere. ε-greedy may take O(grid size) episodes. Count bonus reaches goal in O(√visits) tabular regimes. ICM helps when observations are images of the grid.


Common mistakes

MistakeSymptomFix
β too large intrinsicIgnores task rewardAnneal β
Curiosity on noiseAgent frozen on TVRND norm, ensemble
No episodic reset bonusSame room re-noveltyEpisodic counts
Exploration in unsafe simDangerous real deploySafe sim + constraints
Same ε train/testWrong evalGreedy / mean at eval

Closing

Exploration is not one trick — match state representation and reward sparsity to the bonus. Intrinsic motivation bootstraps learning when extrinsic signal is rare; combine with offline data and safety filters before production-facing exploration.


Before this lesson


What's next