Exploration & intrinsic motivation

Before we begin

ε-greedy and entropy bonuses help, but sparse-reward domains — maze with goal in corner, hard exploration games — need structured exploration. Intrinsic motivation adds reward for novelty, prediction error, or empowerment so the agent discovers skills before task reward arrives.

Extrinsic reward — environment task signal rₜ.
Intrinsic reward — rᵢₜ from curiosity, count bonus, etc.
Exploration bonus — encourages visiting under-explored states.

What you will learn

Classify exploration: random, optimism, count-based, intrinsic.
Implement prediction-error curiosity (ICM-style intuition).
Connect entropy regularization (SAC) to exploration.
Recognize noisy TV problem — stochastic noise looks "novel".
Choose exploration strategy by reward density and state representation.

Exploration taxonomy

Method	Mechanism	Best when
ε-greedy	Random actions	Small discrete, dense reward
Boltzmann / softmax	Sample from Q	Discrete
OU / Gaussian noise	Continuous perturb	DDPG
Entropy (SAC)	Maximize H(π)	Continuous control
Count / pseudo-count	Bonus 1/√N(s)	Tabular / learned hash
Curiosity (ICM)	‖f(s,a) − s′‖ error	Sparse, visual
RND	Random net feature novelty	Hard exploration games

Count-based exploration

In tabular settings, bonus b(s) = β / √N(s) encourages rare states. Pseudo-counts from density models extend to high dimensions (CTS, PixelCNN — research-heavy).

Checkpoint: Why does count bonus fail on raw pixels without hashing?

Answer

Almost every frame is unique — counts never grow, bonus never decays. Need abstract state (downsample, learned embedding, ICM features) so semantically similar states share counts.

Curiosity-driven exploration (ICM sketch)

Train forward model f(s, a) ≈ s′ in feature space; intrinsic reward = prediction error ‖φ(s′) − f(φ(s), a)‖². Agent seeks states where dynamics are surprising — but not irreducible noise.

python

# Intrinsic reward sketch
phi_s = encoder(s)
phi_s_next = encoder(s_next)
pred = forward_model(phi_s, a)
r_intrinsic = (phi_s_next - pred).pow(2).mean()
total_reward = r_extrinsic + beta * r_intrinsic

Inverse model can filter uncontrollable noise — predict a from (φ(s), φ(s′)); only forward error on controllable features counts.

Random Network Distillation (RND)

Fixed random network R(s); train predictor R̂(s) to match R on visited states. Novel states have high ‖R − R̂‖ — bonus. Simpler than full dynamics model; worked on Montezuma's Revenge benchmarks.

Pros	Cons
Easy to implement	Can chase stochastic noise
Scales with CNN features	Needs normalization

Noisy TV and exploration traps

A random TV (irreducible noise) yields perpetual prediction error — agent stares at noise instead of exploring maze. Mitigations: disagreement ensembles, inverse models, reward normalization, episodic memory (count without global noise).

Exploration vs exploitation in production

Research	Production
Maximize novelty	Constrain to safe actions
Long horizons	Limited regret per user
Sim resets	Real cost per trial

Use shadow policies, contextual bandits, or small traffic slices for exploration. Offline logs from logging policy provide coverage — tie to Lesson 1.

Worked example: sparse grid goal

Extrinsic: +1 at goal, 0 elsewhere. ε-greedy may take O(grid size) episodes. Count bonus reaches goal in O(√visits) tabular regimes. ICM helps when observations are images of the grid.

Common mistakes

Mistake	Symptom	Fix
β too large intrinsic	Ignores task reward	Anneal β
Curiosity on noise	Agent frozen on TV	RND norm, ensemble
No episodic reset bonus	Same room re-novelty	Episodic counts
Exploration in unsafe sim	Dangerous real deploy	Safe sim + constraints
Same ε train/test	Wrong eval	Greedy / mean at eval

Closing

Exploration is not one trick — match state representation and reward sparsity to the bonus. Intrinsic motivation bootstraps learning when extrinsic signal is rare; combine with offline data and safety filters before production-facing exploration.

Before this lesson

Previous lesson

What's next

Next lesson — Multi-agent RL basics