← Back to curriculum

Module 4 — Deep Q-networks

Double, dueling & prioritized replay

Overestimation bias, advantage streams, and PER sampling.

~65 min read + exercises

Double, dueling & prioritized replay

Before we begin

Vanilla DQN tends to overestimate action values because the max operator picks the noisiest positive error. Double DQN decouples action selection from evaluation. Dueling architectures separate state value V(s) from advantages A(s,a). Prioritized replay samples important transitions more often. Together they are the modern DQN baseline stack.


Learning objectives

  • Write the Double DQN target using online net for argmax, target net for value.
  • Draw the dueling head: Q(s,a) = V(s) + A(s,a) − mean_a A(s,a).
  • Explain TD-error magnitude as a priority signal.
  • Compare uniform vs prioritized replay trade-offs.
  • Recognize these as patches on the same DQN loop, not new paradigms.

Double DQN target

Standard DQN uses θ⁻ for both picking a′ and evaluating it — double counting noise.

text
a* = argmax_a' Q(s', a'; θ)          # online network selects
y = r + γ Q(s', a*; θ⁻)              # target network evaluates
python
with torch.no_grad():
    online_next = policy_net(s2)
    best_actions = online_next.argmax(dim=1)
    target_next = target_net(s2)
    next_q = target_next.gather(1, best_actions.unsqueeze(1)).squeeze(1)
    target = r + gamma * (1.0 - d) * next_q

Dueling architecture

Two streams share a trunk, then combine:

python
class DuelingQNet(nn.Module):
    def __init__(self, obs_dim, n_actions, hidden=128):
        super().__init__()
        self.trunk = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.value = nn.Linear(hidden, 1)
        self.advantage = nn.Linear(hidden, n_actions)
 
    def forward(self, x):
        h = self.trunk(x)
        v = self.value(h)
        adv = self.advantage(h)
        return v + adv - adv.mean(dim=1, keepdim=True)

Intuition: many states have similar value regardless of action (pole nearly balanced). V(s) captures that; A(s,a) captures action-specific edge.

Worked example — overestimation

True Q*(s,a) ≈ 10 for both actions. Noisy estimates:

ActionQ onlineQ target
010.29.8
110.59.0

Vanilla max picks action 1 (10.5) but true value is 9.0 — overestimate by 1.5. Double DQN: online picks 1, target evaluates 9.0 — unbiased directionally if errors are symmetric.

Prioritized experience replay (PER)

Priority ∝ |TD error|^α with α ≈ 0.6. Importance-sampling weights β anneal to 1.0 correct bias.

ComponentRole
Priority p_iLarger TD error → sampled more
Sum treeO(log n) sampling by priority
IS weight w_iUnbiases gradient from non-uniform sampling

Conceptual sampling loop:

python
# priorities[i] = (abs(td_error) + eps) ** alpha
# prob_i = priorities[i] / sum(priorities)
# sample indices with prob_i; weight batch by (N * prob_i) ** (-beta)

Full sum-tree PER is ~100 lines — libraries like Stable-Baselines3 bundle tested implementations.

Stacking the improvements

text
DQN baseline
  + experience replay
  + target network
  + Double DQN target
  + Dueling head (optional architecture)
  + PER (optional sampling)
  = "Rainbow" when combined with n-step and distributional heads

For CartPole, Double + Dueling alone often reaches 500 return; PER helps more on sparse-reward Atari.

Checkpoint — details: Ask whether your bottleneck is wrong values (try Double), representation (try Dueling), or sample efficiency (try PER / n-step). Summary: These are orthogonal knobs on the same replay-training loop.

Common mistakes

  1. Applying Double DQN but still using max on target net only — must use online argmax.
  2. Dueling without centering advantages — identifiability issues; always subtract mean advantage.
  3. PER α = 1.0 from step zero — overfits to noisy early TD errors; start α ≈ 0.6.
  4. Ignoring IS weights — prioritized sampling biases gradients without correction.
  5. Expecting PER to fix bad hyperparameters — it reweights data, not magic.

Before this lesson


What's next