Continuous action spaces

Before we begin

So far most algorithms assumed finite discrete actions — left/right, fire/no-fire. Real robots, cars, and joints need continuous control: torques, velocities, steering angles in a bounded interval or vector space. This changes which algorithms apply, how exploration works, and how function approximators parameterize policies.

Continuous action space — A ⊆ ℝⁿ, e.g. each dimension in [-1, 1].
Discretization — bin each dimension; use DQN (curse of dimensionality).
Policy parameterization — Gaussian mean + std, or deterministic tanh-squashed output.

What you will learn

Represent continuous actions in Gymnasium (Box space).
Contrast discretization, policy gradients, and actor–critic for continuous control.
Parameterize stochastic policies (Gaussian) vs deterministic (tanh).
Understand exploration without ε-greedy.
Map problems to algorithms: PPO, DDPG, SAC (next lessons).

Gymnasium Box spaces

python

import gymnasium as gym
env = gym.make("Pendulum-v1")
print(env.action_space)       # Box(-2.0, 2.0, (1,), float32)
print(env.observation_space)  # Box(3,) — cos θ, sin θ, θ̇
 
action = env.action_space.sample()  # uniform in bounds
# For learned policies, scale tanh output to low/high
low, high = env.action_space.low, env.action_space.high
scaled = low + (action_tanh + 1) * 0.5 * (high - low)

Observation often includes velocities and trigonometric features for angles (wrap-around). Always clip actions to bounds before env.step to avoid undefined behavior.

Why discretization breaks

Discretize each of n dimensions into k bins → kⁿ actions. A 7-DOF robot arm with 10 bins per joint = 10⁷ = 10 million actions — infeasible for Q-learning.

n dims	k=5 bins	k=11 bins
1	5	11
3	125	1,331
6	15,625	1.77M

Cross-discretization also loses fine control — jerky motion, instability near goal.

Checkpoint: Can PPO from Module 6 handle continuous actions directly?

Answer

Yes. PPO outputs a Gaussian policy (mean per dimension + learnable std) or squashed Gaussian. It is a standard baseline for continuous control when you already have on-policy infrastructure. DDPG and SAC add off-policy sample efficiency for many continuous domains.

Policy parameterizations

Stochastic Gaussian

π(a|s) = Normal(μ_θ(s), σ_θ(s)). Sample a for exploration; use μ at eval. Log-probability needed for policy gradient.

Squashed Gaussian (SAC, PPO)

Sample u ~ Normal, then a = tanh(u) scaled to bounds. Jacobian correction in log-prob for tanh.

Deterministic

a = tanh(μ_θ(s)) — no built-in exploration; add noise (Ornstein–Uhlenbeck or Gaussian) during training (DDPG).

Style	Exploration	Off-policy friendly?
Gaussian PG	Built-in σ	REINFORCE on-policy
Squashed Gaussian	σ + entropy bonus	SAC
Deterministic + noise	External noise	DDPG, TD3

Value functions in continuous actions

Q(s, a) is defined for continuous a — but argmaxₐ Q(s, a) has no closed form for neural Q. Options:

Cross-entropy optimization over a each step (expensive).
Deterministic policy gradient — actor outputs â directly; critic Q(s, â).
Stochastic actor — maximize E_a~π[Q(s,a)] via reparameterization.

Hence actor–critic dominates continuous control; pure DQN needs discretization or CEM planning.

Worked example: Pendulum torque

State: angle and angular velocity. Action: torque ∈ [-2, 2]. Reward: −(θ² + 0.1θ̇² + 0.001a²) — upright is best, small torques preferred.

Random policy average return ≈ −1200. A tuned SAC often reaches −150 within 50k steps. Continuous torque allows smooth balance; discretizing torque to −2, 0, or 2 makes balancing much harder.

Algorithm selection guide

Algorithm	On/off policy	Deterministic?	Typical env
PPO	On	Stochastic	General continuous
DDPG	Off	Yes (+ noise)	MuJoCo benchmarks
TD3	Off	Yes (twin Q)	Same, more stable
SAC	Off	Stochastic	Sample-efficient control

Common mistakes

Mistake	Symptom	Fix
Unscaled tanh output	Saturated actions	Affine map to low/high
Zero exploration (deterministic)	Stuck in local policy	OU noise or stochastic policy
Wrong log-prob (no tanh correction)	Biased gradients	Use library loss or Jacobian
Discretizing unnecessarily	Jerky, slow learning	PPO/SAC on Box
Ignoring action cost in reward	Oscillating torques	Penalty on ‖a‖² in reward design

Closing

Continuous control is the default for robotics and physics simulation. You represent actions as vectors, parameterize policies that output real-valued commands, and pick actor–critic methods that avoid brute-force maximization over Q. Next: DDPG for deterministic off-policy control, then SAC for entropy-regularized stochastic control.

Before this lesson

Previous lesson

What's next

Next lesson — DDPG & deterministic policies