Neural networks as approximators
Before we begin
A neural network is a flexible function approximator: layers of nonlinear transforms map raw observations to values or policy logits. Deep Q-Networks use a net to output Q(s, ·) for all discrete actions; policy gradients output action probabilities or continuous means.
The update logic from linear FA carries over — but capacity, nonlinearity, and training dynamics change everything.
What you will learn
- Map linear FA notation to Q(s, a; θ) with network weights θ.
- Architect DQN-style heads (shared trunk, per-action Q or single Q with action input).
- Write loss for Q-fitting as supervised regression on TD targets.
- Choose activations, output layers, and input preprocessing for RL.
- Connect to PyTorch training loop patterns.
From w to θ
Linear: Q̂(s,a) = w_aᵀ φ(s)
Neural: Q̂(s,a) = f_θ(s, a) where f is MLP or CNN.
Shared trunk h = CNN(obs); Q_a = head_a(h) — efficient for Atari (many actions, one image).
import torch
import torch.nn as nn
class QNet(nn.Module):
def __init__(self, obs_dim, n_actions, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def forward(self, obs):
return self.net(obs) # shape (batch, n_actions)TD target as regression label
For transition (s, a, r, s′):
y = r + γ max_a′ Q̂(s′, a′; θ⁻)
Loss on (s, a): L = (y − Q̂(s, a; θ))²
θ⁻ is target network (frozen copy) in DQN — reduces moving-target instability.
def q_learning_loss(q_net, target_net, batch, gamma):
obs, actions, rewards, next_obs, terminated = batch
q_sa = q_net(obs).gather(1, actions.unsqueeze(1)).squeeze(1)
with torch.no_grad():
q_next = target_net(next_obs).max(dim=1).values
q_next = q_next * (~terminated).float()
target = rewards + gamma * q_next
return ((target - q_sa) ** 2).mean()Architecture choices
| Problem | Typical arch |
|---|---|
| Low-dim vector (CartPole) | 2–3 layer MLP |
| Images | CNN (Nature DQN stack) |
| Continuous actions | Separate actor (policy) net — Module 5+ |
| Output | Activation |
|---|---|
| Q-values | None (unbounded) |
| Policy probs | Softmax |
| Continuous mean | Tanh × scale |
Preprocessing
- Normalize observations (running mean/std).
- Frame stack for velocity (Atari).
- Reward clipping sometimes stabilizes (-1, 0, +1) — changes problem definition.
- Float32 tensors on GPU for throughput.
Training loop sketch
- Collect transition with ε-greedy Q-net.
- Store in replay buffer.
- Sample mini-batch.
- Compute loss,
loss.backward(), optimizer step. - Periodically θ⁻ ← θ.
Checkpoint: Why not use θ as both predictor and target every step?
Answer
Moving target — network chases its own shifting predictions → oscillation/divergence. Target net slows target movement (Module 4).
Capacity trade-offs
| Too small | Too large |
|---|---|
| Underfitting, poor Q | Overfitting recent data, slow train |
| Fast | Needs more data, regularization |
Start small on CartPole; scale CNN for pixels.
Common mistakes
- Wrong
gatherindex shape in PyTorch Q loss. - Not zeroing
optimizer.zero_grad()— gradient accumulation bugs. - Training on correlated consecutive frames without replay.
- Applying softmax to Q-values (distorts ranking).
Neural approximators power modern deep RL. They also combine with off-policy bootstrapping to form the deadly triad — why training can diverge without replay, target nets, and careful tuning.