Continuous action spaces
Before we begin
So far most algorithms assumed finite discrete actions — left/right, fire/no-fire. Real robots, cars, and joints need continuous control: torques, velocities, steering angles in a bounded interval or vector space. This changes which algorithms apply, how exploration works, and how function approximators parameterize policies.
Continuous action space — A ⊆ ℝⁿ, e.g. each dimension in [-1, 1].
Discretization — bin each dimension; use DQN (curse of dimensionality).
Policy parameterization — Gaussian mean + std, or deterministic tanh-squashed output.
What you will learn
- Represent continuous actions in Gymnasium (Box space).
- Contrast discretization, policy gradients, and actor–critic for continuous control.
- Parameterize stochastic policies (Gaussian) vs deterministic (tanh).
- Understand exploration without ε-greedy.
- Map problems to algorithms: PPO, DDPG, SAC (next lessons).
Gymnasium Box spaces
import gymnasium as gym
env = gym.make("Pendulum-v1")
print(env.action_space) # Box(-2.0, 2.0, (1,), float32)
print(env.observation_space) # Box(3,) — cos θ, sin θ, θ̇
action = env.action_space.sample() # uniform in bounds
# For learned policies, scale tanh output to low/high
low, high = env.action_space.low, env.action_space.high
scaled = low + (action_tanh + 1) * 0.5 * (high - low)Observation often includes velocities and trigonometric features for angles (wrap-around). Always clip actions to bounds before env.step to avoid undefined behavior.
Why discretization breaks
Discretize each of n dimensions into k bins → kⁿ actions. A 7-DOF robot arm with 10 bins per joint = 10⁷ = 10 million actions — infeasible for Q-learning.
| n dims | k=5 bins | k=11 bins |
|---|---|---|
| 1 | 5 | 11 |
| 3 | 125 | 1,331 |
| 6 | 15,625 | 1.77M |
Cross-discretization also loses fine control — jerky motion, instability near goal.
Checkpoint: Can PPO from Module 6 handle continuous actions directly?
Answer
Yes. PPO outputs a Gaussian policy (mean per dimension + learnable std) or squashed Gaussian. It is a standard baseline for continuous control when you already have on-policy infrastructure. DDPG and SAC add off-policy sample efficiency for many continuous domains.
Policy parameterizations
Stochastic Gaussian
π(a|s) = Normal(μ_θ(s), σ_θ(s)). Sample a for exploration; use μ at eval. Log-probability needed for policy gradient.
Squashed Gaussian (SAC, PPO)
Sample u ~ Normal, then a = tanh(u) scaled to bounds. Jacobian correction in log-prob for tanh.
Deterministic
a = tanh(μ_θ(s)) — no built-in exploration; add noise (Ornstein–Uhlenbeck or Gaussian) during training (DDPG).
| Style | Exploration | Off-policy friendly? |
|---|---|---|
| Gaussian PG | Built-in σ | REINFORCE on-policy |
| Squashed Gaussian | σ + entropy bonus | SAC |
| Deterministic + noise | External noise | DDPG, TD3 |
Value functions in continuous actions
Q(s, a) is defined for continuous a — but argmaxₐ Q(s, a) has no closed form for neural Q. Options:
- Cross-entropy optimization over a each step (expensive).
- Deterministic policy gradient — actor outputs â directly; critic Q(s, â).
- Stochastic actor — maximize E_a~π[Q(s,a)] via reparameterization.
Hence actor–critic dominates continuous control; pure DQN needs discretization or CEM planning.
Worked example: Pendulum torque
State: angle and angular velocity. Action: torque ∈ [-2, 2]. Reward: −(θ² + 0.1θ̇² + 0.001a²) — upright is best, small torques preferred.
Random policy average return ≈ −1200. A tuned SAC often reaches −150 within 50k steps. Continuous torque allows smooth balance; discretizing torque to −2, 0, or 2 makes balancing much harder.
Algorithm selection guide
| Algorithm | On/off policy | Deterministic? | Typical env |
|---|---|---|---|
| PPO | On | Stochastic | General continuous |
| DDPG | Off | Yes (+ noise) | MuJoCo benchmarks |
| TD3 | Off | Yes (twin Q) | Same, more stable |
| SAC | Off | Stochastic | Sample-efficient control |
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Unscaled tanh output | Saturated actions | Affine map to low/high |
| Zero exploration (deterministic) | Stuck in local policy | OU noise or stochastic policy |
| Wrong log-prob (no tanh correction) | Biased gradients | Use library loss or Jacobian |
| Discretizing unnecessarily | Jerky, slow learning | PPO/SAC on Box |
| Ignoring action cost in reward | Oscillating torques | Penalty on ‖a‖² in reward design |
Closing
Continuous control is the default for robotics and physics simulation. You represent actions as vectors, parameterize policies that output real-valued commands, and pick actor–critic methods that avoid brute-force maximization over Q. Next: DDPG for deterministic off-policy control, then SAC for entropy-regularized stochastic control.