Double, dueling & prioritized replay
Before we begin
Vanilla DQN tends to overestimate action values because the max operator picks the noisiest positive error. Double DQN decouples action selection from evaluation. Dueling architectures separate state value V(s) from advantages A(s,a). Prioritized replay samples important transitions more often. Together they are the modern DQN baseline stack.
Learning objectives
- Write the Double DQN target using online net for argmax, target net for value.
- Draw the dueling head: Q(s,a) = V(s) + A(s,a) − mean_a A(s,a).
- Explain TD-error magnitude as a priority signal.
- Compare uniform vs prioritized replay trade-offs.
- Recognize these as patches on the same DQN loop, not new paradigms.
Double DQN target
Standard DQN uses θ⁻ for both picking a′ and evaluating it — double counting noise.
a* = argmax_a' Q(s', a'; θ) # online network selects
y = r + γ Q(s', a*; θ⁻) # target network evaluateswith torch.no_grad():
online_next = policy_net(s2)
best_actions = online_next.argmax(dim=1)
target_next = target_net(s2)
next_q = target_next.gather(1, best_actions.unsqueeze(1)).squeeze(1)
target = r + gamma * (1.0 - d) * next_qDueling architecture
Two streams share a trunk, then combine:
class DuelingQNet(nn.Module):
def __init__(self, obs_dim, n_actions, hidden=128):
super().__init__()
self.trunk = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
)
self.value = nn.Linear(hidden, 1)
self.advantage = nn.Linear(hidden, n_actions)
def forward(self, x):
h = self.trunk(x)
v = self.value(h)
adv = self.advantage(h)
return v + adv - adv.mean(dim=1, keepdim=True)Intuition: many states have similar value regardless of action (pole nearly balanced). V(s) captures that; A(s,a) captures action-specific edge.
Worked example — overestimation
True Q*(s,a) ≈ 10 for both actions. Noisy estimates:
| Action | Q online | Q target |
|---|---|---|
| 0 | 10.2 | 9.8 |
| 1 | 10.5 | 9.0 |
Vanilla max picks action 1 (10.5) but true value is 9.0 — overestimate by 1.5. Double DQN: online picks 1, target evaluates 9.0 — unbiased directionally if errors are symmetric.
Prioritized experience replay (PER)
Priority ∝ |TD error|^α with α ≈ 0.6. Importance-sampling weights β anneal to 1.0 correct bias.
| Component | Role |
|---|---|
| Priority p_i | Larger TD error → sampled more |
| Sum tree | O(log n) sampling by priority |
| IS weight w_i | Unbiases gradient from non-uniform sampling |
Conceptual sampling loop:
# priorities[i] = (abs(td_error) + eps) ** alpha
# prob_i = priorities[i] / sum(priorities)
# sample indices with prob_i; weight batch by (N * prob_i) ** (-beta)Full sum-tree PER is ~100 lines — libraries like Stable-Baselines3 bundle tested implementations.
Stacking the improvements
DQN baseline
+ experience replay
+ target network
+ Double DQN target
+ Dueling head (optional architecture)
+ PER (optional sampling)
= "Rainbow" when combined with n-step and distributional headsFor CartPole, Double + Dueling alone often reaches 500 return; PER helps more on sparse-reward Atari.
Checkpoint — details: Ask whether your bottleneck is wrong values (try Double), representation (try Dueling), or sample efficiency (try PER / n-step). Summary: These are orthogonal knobs on the same replay-training loop.
Common mistakes
- Applying Double DQN but still using max on target net only — must use online argmax.
- Dueling without centering advantages — identifiability issues; always subtract mean advantage.
- PER α = 1.0 from step zero — overfits to noisy early TD errors; start α ≈ 0.6.
- Ignoring IS weights — prioritized sampling biases gradients without correction.
- Expecting PER to fix bad hyperparameters — it reweights data, not magic.