A2C & parallel RL
Before we begin
Advantage Actor–Critic (A2C) is synchronous parallel RL: multiple workers collect experience simultaneously, then a single optimizer updates shared weights. It is PPO's lighter cousin — no clip, often one epoch per batch — but the vectorized environment pattern is identical and essential for throughput.
Learning objectives
- Contrast A2C (synchronous) with A3C (asynchronous).
- Use Gymnasium VectorEnv or SB3 SubprocVecEnv.
- Aggregate rollouts from N workers into one GAE batch.
- Explain wall-clock speedup vs sample efficiency tradeoffs.
- Choose A2C vs PPO for a given project budget.
A2C update (no clip)
Same actor–critic + GAE as PPO, but typically:
L = − log π(a|s) · A_t + c_1 (V − R)² − c_2 H(π)One pass over data per iteration — simpler, sometimes less stable than PPO on hard envs.
Vectorized collection
import gymnasium as gym
n_envs = 8
envs = gym.make_vec("CartPole-v1", num_envs=n_envs, vectorization_mode="sync")
obs, _ = envs.reset(seed=42)
# obs shape: (n_envs, obs_dim)
for step in range(rollout_steps):
actions = policy.act(obs) # (n_envs,)
obs2, rewards, term, trunc, infos = envs.step(actions)
dones = term | trunc
buffer.store(obs, actions, rewards, obs2, dones)
obs = obs2Parallel envs improve steps per second, not sample efficiency per step — still need same total environment steps to learn.
Worked example — throughput
| Setup | Steps/sec (illustrative) | Time to 1M steps |
|---|---|---|
| 1 env | 2,000 | ~8.3 min |
| 8 envs | 12,000 | ~1.4 min |
| 32 envs (GPU policy) | 40,000+ | ~25 sec |
Diminishing returns when policy forward pass or env simulation bottlenecks.
A2C vs A3C vs PPO
| Algorithm | Parallelism | Stability | Typical use |
|---|---|---|---|
| A3C | Async workers + stale grads | Noisy but once popular | Legacy |
| A2C | Sync batch update | Moderate | Baseline parallel |
| PPO | Sync + clip + multi-epoch | Strong | Default |
Modern libraries favor synchronous updates — GPUs prefer batched tensors over lock-heavy async.
Shared network batch forward
import torch
def act_batch(policy, obs_batch):
"""obs_batch: (n_envs, obs_dim)"""
obs_t = torch.as_tensor(obs_batch, dtype=torch.float32)
with torch.no_grad():
dist = policy(obs_t)
actions = dist.sample()
log_probs = dist.log_prob(actions)
values = policy.value_head(obs_t)
return actions.numpy(), log_probs.numpy(), values.numpy()Batching amortizes GPU kernel launch overhead.
Sample efficiency vs wall-clock
- More envs → faster wall-clock to N total steps.
- PPO multi-epoch → more gradient updates per env step (better sample use, risk overfit).
- A2C → 1 update per rollout (faster iterations, may need more env steps).
For Lunar Lander project: PPO + 4–8 parallel envs is a sweet spot on laptop CPU.
Stable-Baselines3 one-liner (reference)
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
vec_env = make_vec_env("LunarLander-v2", n_envs=8)
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=500_000)Swap PPO for A2C to compare learning curves fairly with same vec env.
Checkpoint — details: Parallelism fixes data hunger and speed, not Markov assumptions — each env still needs proper resets and seeds. Summary: A2C/PPO scale by batching many envs; PPO adds clip for safer reuse of each batch.
Common mistakes
- Confusing env steps with gradient steps — 8 envs × 256 steps = 2048 transitions per rollout, not 256.
- Different seeds per env — use
seed + rankfor diversity. - Not auto-resetting vec envs — Gymnasium vec API handles terminal resets; log episode stats from
infos. - Huge n_envs on CPU MLP — overhead dominates past ~16 envs on small nets.
- Comparing A2C to PPO with different total timesteps — match environment interaction budget.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.