← Back to curriculum

Module 8 — Continuous control & robotics

Continuous action spaces

Gaussian policies, tanh squashing, action bounds, and reparameterization.

~55 min read + exercises

Continuous action spaces

Before we begin

So far most algorithms assumed finite discrete actions — left/right, fire/no-fire. Real robots, cars, and joints need continuous control: torques, velocities, steering angles in a bounded interval or vector space. This changes which algorithms apply, how exploration works, and how function approximators parameterize policies.

Continuous action space — A ⊆ ℝⁿ, e.g. each dimension in [-1, 1].
Discretization — bin each dimension; use DQN (curse of dimensionality).
Policy parameterization — Gaussian mean + std, or deterministic tanh-squashed output.


What you will learn

  • Represent continuous actions in Gymnasium (Box space).
  • Contrast discretization, policy gradients, and actor–critic for continuous control.
  • Parameterize stochastic policies (Gaussian) vs deterministic (tanh).
  • Understand exploration without ε-greedy.
  • Map problems to algorithms: PPO, DDPG, SAC (next lessons).

Gymnasium Box spaces

python
import gymnasium as gym
env = gym.make("Pendulum-v1")
print(env.action_space)       # Box(-2.0, 2.0, (1,), float32)
print(env.observation_space)  # Box(3,) — cos θ, sin θ, θ̇
 
action = env.action_space.sample()  # uniform in bounds
# For learned policies, scale tanh output to low/high
low, high = env.action_space.low, env.action_space.high
scaled = low + (action_tanh + 1) * 0.5 * (high - low)

Observation often includes velocities and trigonometric features for angles (wrap-around). Always clip actions to bounds before env.step to avoid undefined behavior.


Why discretization breaks

Discretize each of n dimensions into k bins → kⁿ actions. A 7-DOF robot arm with 10 bins per joint = 10⁷ = 10 million actions — infeasible for Q-learning.

n dimsk=5 binsk=11 bins
1511
31251,331
615,6251.77M

Cross-discretization also loses fine control — jerky motion, instability near goal.

Checkpoint: Can PPO from Module 6 handle continuous actions directly?

Answer

Yes. PPO outputs a Gaussian policy (mean per dimension + learnable std) or squashed Gaussian. It is a standard baseline for continuous control when you already have on-policy infrastructure. DDPG and SAC add off-policy sample efficiency for many continuous domains.


Policy parameterizations

Stochastic Gaussian

π(a|s) = Normal(μ_θ(s), σ_θ(s)). Sample a for exploration; use μ at eval. Log-probability needed for policy gradient.

Squashed Gaussian (SAC, PPO)

Sample u ~ Normal, then a = tanh(u) scaled to bounds. Jacobian correction in log-prob for tanh.

Deterministic

a = tanh(μ_θ(s)) — no built-in exploration; add noise (Ornstein–Uhlenbeck or Gaussian) during training (DDPG).

StyleExplorationOff-policy friendly?
Gaussian PGBuilt-in σREINFORCE on-policy
Squashed Gaussianσ + entropy bonusSAC
Deterministic + noiseExternal noiseDDPG, TD3

Value functions in continuous actions

Q(s, a) is defined for continuous a — but argmaxₐ Q(s, a) has no closed form for neural Q. Options:

  1. Cross-entropy optimization over a each step (expensive).
  2. Deterministic policy gradient — actor outputs â directly; critic Q(s, â).
  3. Stochastic actor — maximize E_a~π[Q(s,a)] via reparameterization.

Hence actor–critic dominates continuous control; pure DQN needs discretization or CEM planning.


Worked example: Pendulum torque

State: angle and angular velocity. Action: torque ∈ [-2, 2]. Reward: −(θ² + 0.1θ̇² + 0.001a²) — upright is best, small torques preferred.

Random policy average return ≈ −1200. A tuned SAC often reaches −150 within 50k steps. Continuous torque allows smooth balance; discretizing torque to −2, 0, or 2 makes balancing much harder.


Algorithm selection guide

AlgorithmOn/off policyDeterministic?Typical env
PPOOnStochasticGeneral continuous
DDPGOffYes (+ noise)MuJoCo benchmarks
TD3OffYes (twin Q)Same, more stable
SACOffStochasticSample-efficient control

Common mistakes

MistakeSymptomFix
Unscaled tanh outputSaturated actionsAffine map to low/high
Zero exploration (deterministic)Stuck in local policyOU noise or stochastic policy
Wrong log-prob (no tanh correction)Biased gradientsUse library loss or Jacobian
Discretizing unnecessarilyJerky, slow learningPPO/SAC on Box
Ignoring action cost in rewardOscillating torquesPenalty on ‖a‖² in reward design

Closing

Continuous control is the default for robotics and physics simulation. You represent actions as vectors, parameterize policies that output real-valued commands, and pick actor–critic methods that avoid brute-force maximization over Q. Next: DDPG for deterministic off-policy control, then SAC for entropy-regularized stochastic control.


Before this lesson


What's next