Double, dueling & prioritized replay

Before we begin

Vanilla DQN tends to overestimate action values because the max operator picks the noisiest positive error. Double DQN decouples action selection from evaluation. Dueling architectures separate state value V(s) from advantages A(s,a). Prioritized replay samples important transitions more often. Together they are the modern DQN baseline stack.

Learning objectives

Write the Double DQN target using online net for argmax, target net for value.
Draw the dueling head: Q(s,a) = V(s) + A(s,a) − mean_a A(s,a).
Explain TD-error magnitude as a priority signal.
Compare uniform vs prioritized replay trade-offs.
Recognize these as patches on the same DQN loop, not new paradigms.

Double DQN target

Standard DQN uses θ⁻ for both picking a′ and evaluating it — double counting noise.

text

a* = argmax_a' Q(s', a'; θ)          # online network selects
y = r + γ Q(s', a*; θ⁻)              # target network evaluates

python

with torch.no_grad():
    online_next = policy_net(s2)
    best_actions = online_next.argmax(dim=1)
    target_next = target_net(s2)
    next_q = target_next.gather(1, best_actions.unsqueeze(1)).squeeze(1)
    target = r + gamma * (1.0 - d) * next_q

Dueling architecture

Two streams share a trunk, then combine:

python

class DuelingQNet(nn.Module):
    def __init__(self, obs_dim, n_actions, hidden=128):
        super().__init__()
        self.trunk = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.value = nn.Linear(hidden, 1)
        self.advantage = nn.Linear(hidden, n_actions)
 
    def forward(self, x):
        h = self.trunk(x)
        v = self.value(h)
        adv = self.advantage(h)
        return v + adv - adv.mean(dim=1, keepdim=True)

Intuition: many states have similar value regardless of action (pole nearly balanced). V(s) captures that; A(s,a) captures action-specific edge.

Worked example — overestimation

True Q*(s,a) ≈ 10 for both actions. Noisy estimates:

Action	Q online	Q target
0	10.2	9.8
1	10.5	9.0

Vanilla max picks action 1 (10.5) but true value is 9.0 — overestimate by 1.5. Double DQN: online picks 1, target evaluates 9.0 — unbiased directionally if errors are symmetric.

Prioritized experience replay (PER)

Priority ∝ |TD error|^α with α ≈ 0.6. Importance-sampling weights β anneal to 1.0 correct bias.

Component	Role
Priority p_i	Larger TD error → sampled more
Sum tree	O(log n) sampling by priority
IS weight w_i	Unbiases gradient from non-uniform sampling

Conceptual sampling loop:

python

# priorities[i] = (abs(td_error) + eps) ** alpha
# prob_i = priorities[i] / sum(priorities)
# sample indices with prob_i; weight batch by (N * prob_i) ** (-beta)

Full sum-tree PER is ~100 lines — libraries like Stable-Baselines3 bundle tested implementations.

Stacking the improvements

text

DQN baseline
  + experience replay
  + target network
  + Double DQN target
  + Dueling head (optional architecture)
  + PER (optional sampling)
  = "Rainbow" when combined with n-step and distributional heads

For CartPole, Double + Dueling alone often reaches 500 return; PER helps more on sparse-reward Atari.

Checkpoint — details: Ask whether your bottleneck is wrong values (try Double), representation (try Dueling), or sample efficiency (try PER / n-step). Summary: These are orthogonal knobs on the same replay-training loop.

Common mistakes

Applying Double DQN but still using max on target net only — must use online argmax.
Dueling without centering advantages — identifiability issues; always subtract mean advantage.
PER α = 1.0 from step zero — overfits to noisy early TD errors; start α ≈ 0.6.
Ignoring IS weights — prioritized sampling biases gradients without correction.
Expecting PER to fix bad hyperparameters — it reweights data, not magic.

Before this lesson

Previous lesson

What's next

Next lesson — DQN hyperparameters & debugging