Neural networks as approximators

Before we begin

A neural network is a flexible function approximator: layers of nonlinear transforms map raw observations to values or policy logits. Deep Q-Networks use a net to output Q(s, ·) for all discrete actions; policy gradients output action probabilities or continuous means.

The update logic from linear FA carries over — but capacity, nonlinearity, and training dynamics change everything.

What you will learn

Map linear FA notation to Q(s, a; θ) with network weights θ.
Architect DQN-style heads (shared trunk, per-action Q or single Q with action input).
Write loss for Q-fitting as supervised regression on TD targets.
Choose activations, output layers, and input preprocessing for RL.
Connect to PyTorch training loop patterns.

From w to θ

Linear: Q̂(s,a) = w_aᵀ φ(s)
Neural: Q̂(s,a) = f_θ(s, a) where f is MLP or CNN.

Shared trunk h = CNN(obs); Q_a = head_a(h) — efficient for Atari (many actions, one image).

python

import torch
import torch.nn as nn
 
class QNet(nn.Module):
    def __init__(self, obs_dim, n_actions, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, n_actions),
        )
 
    def forward(self, obs):
        return self.net(obs)  # shape (batch, n_actions)

TD target as regression label

For transition (s, a, r, s′):

y = r + γ max_a′ Q̂(s′, a′; θ⁻)

Loss on (s, a): L = (y − Q̂(s, a; θ))²

θ⁻ is target network (frozen copy) in DQN — reduces moving-target instability.

python

def q_learning_loss(q_net, target_net, batch, gamma):
    obs, actions, rewards, next_obs, terminated = batch
    q_sa = q_net(obs).gather(1, actions.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        q_next = target_net(next_obs).max(dim=1).values
        q_next = q_next * (~terminated).float()
        target = rewards + gamma * q_next
    return ((target - q_sa) ** 2).mean()

Architecture choices

Problem	Typical arch
Low-dim vector (CartPole)	2–3 layer MLP
Images	CNN (Nature DQN stack)
Continuous actions	Separate actor (policy) net — Module 5+

Output	Activation
Q-values	None (unbounded)
Policy probs	Softmax
Continuous mean	Tanh × scale

Preprocessing

Normalize observations (running mean/std).
Frame stack for velocity (Atari).
Reward clipping sometimes stabilizes (-1, 0, +1) — changes problem definition.
Float32 tensors on GPU for throughput.

Training loop sketch

Collect transition with ε-greedy Q-net.
Store in replay buffer.
Sample mini-batch.
Compute loss, loss.backward(), optimizer step.
Periodically θ⁻ ← θ.

Checkpoint: Why not use θ as both predictor and target every step?

Answer

Moving target — network chases its own shifting predictions → oscillation/divergence. Target net slows target movement (Module 4).

Capacity trade-offs

Too small	Too large
Underfitting, poor Q	Overfitting recent data, slow train
Fast	Needs more data, regularization

Start small on CartPole; scale CNN for pixels.

Common mistakes

Wrong gather index shape in PyTorch Q loss.
Not zeroing optimizer.zero_grad() — gradient accumulation bugs.
Training on correlated consecutive frames without replay.
Applying softmax to Q-values (distorts ranking).

Neural approximators power modern deep RL. They also combine with off-policy bootstrapping to form the deadly triad — why training can diverge without replay, target nets, and careful tuning.

Before this lesson

Previous lesson

What's next

Next lesson — Instability & the deadly triad