DDPG & deterministic policies
Before we begin
Deep Deterministic Policy Gradient (DDPG) learns a deterministic actor μ_θ(s) and a Q-critic Q_φ(s,a) off-policy from a replay buffer. It extended DQN ideas to continuous action by having the actor differentiate through the critic — the policy gradient becomes ∇_θ Q(s, μ_θ(s)). DDPG was a breakthrough for MuJoCo locomotion; TD3 later fixed its overestimation and instability issues.
DDPG — actor–critic with deterministic policy, target networks, replay.
Deterministic policy gradient — ∇_θ J ≈ E[∇_a Q(s,a) evaluated at a = μ(s), times ∇_θ μ(s)]. Target networks — slow-moving copies for stable Q targets (from DQN).
What you will learn
- Derive the deterministic policy gradient intuition.
- Implement DDPG components: actor, critic, targets, replay, exploration noise.
- Match DDPG hyperparameters to Gymnasium continuous tasks.
- Know why TD3 adds twin critics and delayed actor updates.
- Debug common failure modes: Q explosion, critic collapse.
Architecture
| Network | Input | Output | Update |
|---|---|---|---|
| Actor μ_θ(s) | state | action (tanh-scaled) | Ascend Q w.r.t. actions |
| Critic Q_φ(s,a) | state, action | scalar Q | TD error vs target |
| Target actor μ_θ′ | state | action | Polyak τ soft update |
| Target critic Q_φ′ | state, action | scalar Q | Polyak τ soft update |
# Critic update (sketch)
with torch.no_grad():
a_next = target_actor(s_next)
a_next = a_next + clipped_noise # TD3 only
y = r + gamma * (1 - done) * target_critic(s_next, a_next)
critic_loss = F.mse_loss(critic(s, a), y)
# Actor update — maximize Q(s, mu(s))
actor_loss = -critic(s, actor(s)).mean()Exploration: add Ornstein–Uhlenbeck or Gaussian noise to μ(s) when interacting with the environment, not during critic targets (unless TD3-style target smoothing).
Deterministic policy gradient intuition
Stochastic policies integrate over actions: ∇J = E[∇ log π(a|s) Q(s,a)]. If π is deterministic a = μ(s), the integral disappears — backprop directly through Q to θ:
∇_θ J ≈ E[∇_a Q(s,a) at a = μ(s) · ∇_θ μ(s)]
The actor nudges actions in the direction the critic says increases value. Critic quality caps actor improvement — deadly triad risks remain with function approximation + bootstrapping + off-policy.
Worked example: target computation
s, a=0.5, r=1, s′, done=0. γ=0.99. Target actor gives a′=0.2. Target critic Q′(s′, a′)=10.
y = 1 + 0.99 × 10 = 10.9
If critic predicted Q(s,a)=5, TD error = 5.9 — large update. Target networks slow Q′ and μ′ so y does not chase a moving actor every step.
Checkpoint: Why is a deterministic actor dangerous without exploration noise?
Answer
The replay buffer only contains actions near what the actor already does. Without noise, the critic is never trained on diverse actions → extrapolation error when the actor tries something new. Noise fills the buffer with varied (s,a) pairs.
DDPG training loop
- Observe s, select a = clip(μ(s) + noise).
- Store (s, a, r, s′, done) in replay.
- Sample minibatch; update critic toward Bellman target.
- Update actor to maximize Q(s, μ(s)).
- Soft-update target networks: θ′ ← τθ + (1−τ)θ′.
| Hyperparameter | Typical range | Notes |
|---|---|---|
| τ (Polyak) | 0.001 – 0.01 | Smaller = stabler targets |
| lr_actor / lr_critic | 1e-4 – 3e-4 | Critic often same or faster |
| buffer size | 1e5 – 1e6 | Continuous tasks need diversity |
| batch size | 64 – 256 | |
| γ | 0.99 | Match task horizon |
| OU noise σ | 0.1 – 0.3 | Tune per env |
TD3 improvements (know the names)
| Issue in DDPG | TD3 fix |
|---|---|
| Q overestimation | Twin critics, take min(Q1, Q2) |
| High-variance targets | Target policy smoothing — noise on a′ |
| Moving target | Delayed policy updates — actor every 2 critic steps |
When debugging DDPG, try TD3 in Stable-Baselines3 (TD3 class) before heavy manual tuning.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| No action noise | Flat learning, brittle Q | OU / Gaussian noise |
| τ too large | Oscillating Q | Reduce to 0.005 |
| Actor faster than critic | Nonsense gradients | Delay actor (TD3) |
| Forgot to clip actions | Env NaNs | tanh + scale |
| Batch norm in Q on replay | Non-stationary stats | Layer norm or no BN |
| Evaluating with noise | Suboptimal deploy | μ(s) only at test |
Closing
DDPG showed deep off-policy control with deterministic actors and replay. Practice today often starts with TD3 or SAC for stability, but the DDPG pattern — critic TD + actor ascending Q — is the foundation. Next lesson: SAC adds entropy for robust exploration and automatic temperature tuning.