← Back to curriculum

Module 8 — Continuous control & robotics

DDPG & deterministic policies

Actor–critic for continuous control, target networks, and exploration noise.

~65 min read + exercises

DDPG & deterministic policies

Before we begin

Deep Deterministic Policy Gradient (DDPG) learns a deterministic actor μ_θ(s) and a Q-critic Q_φ(s,a) off-policy from a replay buffer. It extended DQN ideas to continuous action by having the actor differentiate through the critic — the policy gradient becomes ∇_θ Q(s, μ_θ(s)). DDPG was a breakthrough for MuJoCo locomotion; TD3 later fixed its overestimation and instability issues.

DDPG — actor–critic with deterministic policy, target networks, replay.
Deterministic policy gradient — ∇_θ J ≈ E[∇_a Q(s,a) evaluated at a = μ(s), times ∇_θ μ(s)]. Target networks — slow-moving copies for stable Q targets (from DQN).


What you will learn

  • Derive the deterministic policy gradient intuition.
  • Implement DDPG components: actor, critic, targets, replay, exploration noise.
  • Match DDPG hyperparameters to Gymnasium continuous tasks.
  • Know why TD3 adds twin critics and delayed actor updates.
  • Debug common failure modes: Q explosion, critic collapse.

Architecture

NetworkInputOutputUpdate
Actor μ_θ(s)stateaction (tanh-scaled)Ascend Q w.r.t. actions
Critic Q_φ(s,a)state, actionscalar QTD error vs target
Target actor μ_θ′stateactionPolyak τ soft update
Target critic Q_φ′state, actionscalar QPolyak τ soft update
python
# Critic update (sketch)
with torch.no_grad():
    a_next = target_actor(s_next)
    a_next = a_next + clipped_noise  # TD3 only
    y = r + gamma * (1 - done) * target_critic(s_next, a_next)
 
critic_loss = F.mse_loss(critic(s, a), y)
 
# Actor update — maximize Q(s, mu(s))
actor_loss = -critic(s, actor(s)).mean()

Exploration: add Ornstein–Uhlenbeck or Gaussian noise to μ(s) when interacting with the environment, not during critic targets (unless TD3-style target smoothing).


Deterministic policy gradient intuition

Stochastic policies integrate over actions: ∇J = E[∇ log π(a|s) Q(s,a)]. If π is deterministic a = μ(s), the integral disappears — backprop directly through Q to θ:

∇_θ J ≈ E[∇_a Q(s,a) at a = μ(s) · ∇_θ μ(s)]

The actor nudges actions in the direction the critic says increases value. Critic quality caps actor improvement — deadly triad risks remain with function approximation + bootstrapping + off-policy.


Worked example: target computation

s, a=0.5, r=1, s′, done=0. γ=0.99. Target actor gives a′=0.2. Target critic Q′(s′, a′)=10.

y = 1 + 0.99 × 10 = 10.9

If critic predicted Q(s,a)=5, TD error = 5.9 — large update. Target networks slow Q′ and μ′ so y does not chase a moving actor every step.

Checkpoint: Why is a deterministic actor dangerous without exploration noise?

Answer

The replay buffer only contains actions near what the actor already does. Without noise, the critic is never trained on diverse actions → extrapolation error when the actor tries something new. Noise fills the buffer with varied (s,a) pairs.


DDPG training loop

  1. Observe s, select a = clip(μ(s) + noise).
  2. Store (s, a, r, s′, done) in replay.
  3. Sample minibatch; update critic toward Bellman target.
  4. Update actor to maximize Q(s, μ(s)).
  5. Soft-update target networks: θ′ ← τθ + (1−τ)θ′.
HyperparameterTypical rangeNotes
τ (Polyak)0.001 – 0.01Smaller = stabler targets
lr_actor / lr_critic1e-4 – 3e-4Critic often same or faster
buffer size1e5 – 1e6Continuous tasks need diversity
batch size64 – 256
γ0.99Match task horizon
OU noise σ0.1 – 0.3Tune per env

TD3 improvements (know the names)

Issue in DDPGTD3 fix
Q overestimationTwin critics, take min(Q1, Q2)
High-variance targetsTarget policy smoothing — noise on a′
Moving targetDelayed policy updates — actor every 2 critic steps

When debugging DDPG, try TD3 in Stable-Baselines3 (TD3 class) before heavy manual tuning.


Common mistakes

MistakeSymptomFix
No action noiseFlat learning, brittle QOU / Gaussian noise
τ too largeOscillating QReduce to 0.005
Actor faster than criticNonsense gradientsDelay actor (TD3)
Forgot to clip actionsEnv NaNstanh + scale
Batch norm in Q on replayNon-stationary statsLayer norm or no BN
Evaluating with noiseSuboptimal deployμ(s) only at test

Closing

DDPG showed deep off-policy control with deterministic actors and replay. Practice today often starts with TD3 or SAC for stability, but the DDPG pattern — critic TD + actor ascending Q — is the foundation. Next lesson: SAC adds entropy for robust exploration and automatic temperature tuning.


Before this lesson


What's next