DDPG & deterministic policies

Before we begin

Deep Deterministic Policy Gradient (DDPG) learns a deterministic actor μ_θ(s) and a Q-critic Q_φ(s,a) off-policy from a replay buffer. It extended DQN ideas to continuous action by having the actor differentiate through the critic — the policy gradient becomes ∇_θ Q(s, μ_θ(s)). DDPG was a breakthrough for MuJoCo locomotion; TD3 later fixed its overestimation and instability issues.

DDPG — actor–critic with deterministic policy, target networks, replay.
Deterministic policy gradient — ∇_θ J ≈ E[∇_a Q(s,a) evaluated at a = μ(s), times ∇_θ μ(s)]. Target networks — slow-moving copies for stable Q targets (from DQN).

What you will learn

Derive the deterministic policy gradient intuition.
Implement DDPG components: actor, critic, targets, replay, exploration noise.
Match DDPG hyperparameters to Gymnasium continuous tasks.
Know why TD3 adds twin critics and delayed actor updates.
Debug common failure modes: Q explosion, critic collapse.

Architecture

Network	Input	Output	Update
Actor μ_θ(s)	state	action (tanh-scaled)	Ascend Q w.r.t. actions
Critic Q_φ(s,a)	state, action	scalar Q	TD error vs target
Target actor μ_θ′	state	action	Polyak τ soft update
Target critic Q_φ′	state, action	scalar Q	Polyak τ soft update

python

# Critic update (sketch)
with torch.no_grad():
    a_next = target_actor(s_next)
    a_next = a_next + clipped_noise  # TD3 only
    y = r + gamma * (1 - done) * target_critic(s_next, a_next)
 
critic_loss = F.mse_loss(critic(s, a), y)
 
# Actor update — maximize Q(s, mu(s))
actor_loss = -critic(s, actor(s)).mean()

Exploration: add Ornstein–Uhlenbeck or Gaussian noise to μ(s) when interacting with the environment, not during critic targets (unless TD3-style target smoothing).

Deterministic policy gradient intuition

Stochastic policies integrate over actions: ∇J = E[∇ log π(a|s) Q(s,a)]. If π is deterministic a = μ(s), the integral disappears — backprop directly through Q to θ:

∇_θ J ≈ E[∇_a Q(s,a) at a = μ(s) · ∇_θ μ(s)]

The actor nudges actions in the direction the critic says increases value. Critic quality caps actor improvement — deadly triad risks remain with function approximation + bootstrapping + off-policy.

Worked example: target computation

s, a=0.5, r=1, s′, done=0. γ=0.99. Target actor gives a′=0.2. Target critic Q′(s′, a′)=10.

y = 1 + 0.99 × 10 = 10.9

If critic predicted Q(s,a)=5, TD error = 5.9 — large update. Target networks slow Q′ and μ′ so y does not chase a moving actor every step.

Checkpoint: Why is a deterministic actor dangerous without exploration noise?

Answer

The replay buffer only contains actions near what the actor already does. Without noise, the critic is never trained on diverse actions → extrapolation error when the actor tries something new. Noise fills the buffer with varied (s,a) pairs.

DDPG training loop

Observe s, select a = clip(μ(s) + noise).
Store (s, a, r, s′, done) in replay.
Sample minibatch; update critic toward Bellman target.
Update actor to maximize Q(s, μ(s)).
Soft-update target networks: θ′ ← τθ + (1−τ)θ′.

Hyperparameter	Typical range	Notes
τ (Polyak)	0.001 – 0.01	Smaller = stabler targets
lr_actor / lr_critic	1e-4 – 3e-4	Critic often same or faster
buffer size	1e5 – 1e6	Continuous tasks need diversity
batch size	64 – 256
γ	0.99	Match task horizon
OU noise σ	0.1 – 0.3	Tune per env

TD3 improvements (know the names)

Issue in DDPG	TD3 fix
Q overestimation	Twin critics, take min(Q1, Q2)
High-variance targets	Target policy smoothing — noise on a′
Moving target	Delayed policy updates — actor every 2 critic steps

When debugging DDPG, try TD3 in Stable-Baselines3 (TD3 class) before heavy manual tuning.

Common mistakes

Mistake	Symptom	Fix
No action noise	Flat learning, brittle Q	OU / Gaussian noise
τ too large	Oscillating Q	Reduce to 0.005
Actor faster than critic	Nonsense gradients	Delay actor (TD3)
Forgot to clip actions	Env NaNs	tanh + scale
Batch norm in Q on replay	Non-stationary stats	Layer norm or no BN
Evaluating with noise	Suboptimal deploy	μ(s) only at test

Closing

DDPG showed deep off-policy control with deterministic actors and replay. Practice today often starts with TD3 or SAC for stability, but the DDPG pattern — critic TD + actor ascending Q — is the foundation. Next lesson: SAC adds entropy for robust exploration and automatic temperature tuning.

Before this lesson

Previous lesson

What's next

Next lesson — Soft actor–critic (SAC)