← Back to curriculum

Module 3 — Function approximation

Neural networks as approximators

Nonlinear FA, shared representations, and batching transitions.

~60 min read + exercises

Neural networks as approximators

Before we begin

A neural network is a flexible function approximator: layers of nonlinear transforms map raw observations to values or policy logits. Deep Q-Networks use a net to output Q(s, ·) for all discrete actions; policy gradients output action probabilities or continuous means.

The update logic from linear FA carries over — but capacity, nonlinearity, and training dynamics change everything.


What you will learn

  • Map linear FA notation to Q(s, a; θ) with network weights θ.
  • Architect DQN-style heads (shared trunk, per-action Q or single Q with action input).
  • Write loss for Q-fitting as supervised regression on TD targets.
  • Choose activations, output layers, and input preprocessing for RL.
  • Connect to PyTorch training loop patterns.

From w to θ

Linear: Q̂(s,a) = w_aᵀ φ(s)
Neural: Q̂(s,a) = f_θ(s, a) where f is MLP or CNN.

Shared trunk h = CNN(obs); Q_a = head_a(h) — efficient for Atari (many actions, one image).

python
import torch
import torch.nn as nn
 
class QNet(nn.Module):
    def __init__(self, obs_dim, n_actions, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, n_actions),
        )
 
    def forward(self, obs):
        return self.net(obs)  # shape (batch, n_actions)

TD target as regression label

For transition (s, a, r, s′):

y = r + γ max_a′ Q̂(s′, a′; θ⁻)

Loss on (s, a): L = (y − Q̂(s, a; θ))²

θ⁻ is target network (frozen copy) in DQN — reduces moving-target instability.

python
def q_learning_loss(q_net, target_net, batch, gamma):
    obs, actions, rewards, next_obs, terminated = batch
    q_sa = q_net(obs).gather(1, actions.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        q_next = target_net(next_obs).max(dim=1).values
        q_next = q_next * (~terminated).float()
        target = rewards + gamma * q_next
    return ((target - q_sa) ** 2).mean()

Architecture choices

ProblemTypical arch
Low-dim vector (CartPole)2–3 layer MLP
ImagesCNN (Nature DQN stack)
Continuous actionsSeparate actor (policy) net — Module 5+
OutputActivation
Q-valuesNone (unbounded)
Policy probsSoftmax
Continuous meanTanh × scale

Preprocessing

  • Normalize observations (running mean/std).
  • Frame stack for velocity (Atari).
  • Reward clipping sometimes stabilizes (-1, 0, +1) — changes problem definition.
  • Float32 tensors on GPU for throughput.

Training loop sketch

  1. Collect transition with ε-greedy Q-net.
  2. Store in replay buffer.
  3. Sample mini-batch.
  4. Compute loss, loss.backward(), optimizer step.
  5. Periodically θ⁻ ← θ.

Checkpoint: Why not use θ as both predictor and target every step?

Answer

Moving target — network chases its own shifting predictions → oscillation/divergence. Target net slows target movement (Module 4).


Capacity trade-offs

Too smallToo large
Underfitting, poor QOverfitting recent data, slow train
FastNeeds more data, regularization

Start small on CartPole; scale CNN for pixels.


Common mistakes

  • Wrong gather index shape in PyTorch Q loss.
  • Not zeroing optimizer.zero_grad() — gradient accumulation bugs.
  • Training on correlated consecutive frames without replay.
  • Applying softmax to Q-values (distorts ranking).

Neural approximators power modern deep RL. They also combine with off-policy bootstrapping to form the deadly triad — why training can diverge without replay, target nets, and careful tuning.


Before this lesson


What's next