A2C & parallel RL

Before we begin

Advantage Actor–Critic (A2C) is synchronous parallel RL: multiple workers collect experience simultaneously, then a single optimizer updates shared weights. It is PPO's lighter cousin — no clip, often one epoch per batch — but the vectorized environment pattern is identical and essential for throughput.

Learning objectives

Contrast A2C (synchronous) with A3C (asynchronous).
Use Gymnasium VectorEnv or SB3 SubprocVecEnv.
Aggregate rollouts from N workers into one GAE batch.
Explain wall-clock speedup vs sample efficiency tradeoffs.
Choose A2C vs PPO for a given project budget.

A2C update (no clip)

Same actor–critic + GAE as PPO, but typically:

text

L = − log π(a|s) · A_t + c_1 (V − R)² − c_2 H(π)

One pass over data per iteration — simpler, sometimes less stable than PPO on hard envs.

Vectorized collection

python

import gymnasium as gym
 
n_envs = 8
envs = gym.make_vec("CartPole-v1", num_envs=n_envs, vectorization_mode="sync")
 
obs, _ = envs.reset(seed=42)
# obs shape: (n_envs, obs_dim)
 
for step in range(rollout_steps):
    actions = policy.act(obs)  # (n_envs,)
    obs2, rewards, term, trunc, infos = envs.step(actions)
    dones = term | trunc
    buffer.store(obs, actions, rewards, obs2, dones)
    obs = obs2

Parallel envs improve steps per second, not sample efficiency per step — still need same total environment steps to learn.

Worked example — throughput

Setup	Steps/sec (illustrative)	Time to 1M steps
1 env	2,000	~8.3 min
8 envs	12,000	~1.4 min
32 envs (GPU policy)	40,000+	~25 sec

Diminishing returns when policy forward pass or env simulation bottlenecks.

A2C vs A3C vs PPO

Algorithm	Parallelism	Stability	Typical use
A3C	Async workers + stale grads	Noisy but once popular	Legacy
A2C	Sync batch update	Moderate	Baseline parallel
PPO	Sync + clip + multi-epoch	Strong	Default

Modern libraries favor synchronous updates — GPUs prefer batched tensors over lock-heavy async.

Shared network batch forward

python

import torch
 
def act_batch(policy, obs_batch):
    """obs_batch: (n_envs, obs_dim)"""
    obs_t = torch.as_tensor(obs_batch, dtype=torch.float32)
    with torch.no_grad():
        dist = policy(obs_t)
        actions = dist.sample()
        log_probs = dist.log_prob(actions)
        values = policy.value_head(obs_t)
    return actions.numpy(), log_probs.numpy(), values.numpy()

Batching amortizes GPU kernel launch overhead.

Sample efficiency vs wall-clock

More envs → faster wall-clock to N total steps.
PPO multi-epoch → more gradient updates per env step (better sample use, risk overfit).
A2C → 1 update per rollout (faster iterations, may need more env steps).

For Lunar Lander project: PPO + 4–8 parallel envs is a sweet spot on laptop CPU.

Stable-Baselines3 one-liner (reference)

python

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
 
vec_env = make_vec_env("LunarLander-v2", n_envs=8)
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=500_000)

Swap PPO for A2C to compare learning curves fairly with same vec env.

Checkpoint — details: Parallelism fixes data hunger and speed, not Markov assumptions — each env still needs proper resets and seeds. Summary: A2C/PPO scale by batching many envs; PPO adds clip for safer reuse of each batch.

Common mistakes

Confusing env steps with gradient steps — 8 envs × 256 steps = 2048 transitions per rollout, not 256.
Different seeds per env — use seed + rank for diversity.
Not auto-resetting vec envs — Gymnasium vec API handles terminal resets; log episode stats from infos.
Huge n_envs on CPU MLP — overhead dominates past ~16 envs on small nets.
Comparing A2C to PPO with different total timesteps — match environment interaction budget.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.