← Back to curriculum

Module 6 — Actor–critic & PPO

A2C & parallel RL

Synchronous workers, vectorized envs, and throughput vs sample efficiency.

~55 min read + exercises

A2C & parallel RL

Before we begin

Advantage Actor–Critic (A2C) is synchronous parallel RL: multiple workers collect experience simultaneously, then a single optimizer updates shared weights. It is PPO's lighter cousin — no clip, often one epoch per batch — but the vectorized environment pattern is identical and essential for throughput.


Learning objectives

  • Contrast A2C (synchronous) with A3C (asynchronous).
  • Use Gymnasium VectorEnv or SB3 SubprocVecEnv.
  • Aggregate rollouts from N workers into one GAE batch.
  • Explain wall-clock speedup vs sample efficiency tradeoffs.
  • Choose A2C vs PPO for a given project budget.

A2C update (no clip)

Same actor–critic + GAE as PPO, but typically:

text
L = − log π(a|s) · A_t + c_1 (V − R)² − c_2 H(π)

One pass over data per iteration — simpler, sometimes less stable than PPO on hard envs.

Vectorized collection

python
import gymnasium as gym
 
n_envs = 8
envs = gym.make_vec("CartPole-v1", num_envs=n_envs, vectorization_mode="sync")
 
obs, _ = envs.reset(seed=42)
# obs shape: (n_envs, obs_dim)
 
for step in range(rollout_steps):
    actions = policy.act(obs)  # (n_envs,)
    obs2, rewards, term, trunc, infos = envs.step(actions)
    dones = term | trunc
    buffer.store(obs, actions, rewards, obs2, dones)
    obs = obs2

Parallel envs improve steps per second, not sample efficiency per step — still need same total environment steps to learn.

Worked example — throughput

SetupSteps/sec (illustrative)Time to 1M steps
1 env2,000~8.3 min
8 envs12,000~1.4 min
32 envs (GPU policy)40,000+~25 sec

Diminishing returns when policy forward pass or env simulation bottlenecks.

A2C vs A3C vs PPO

AlgorithmParallelismStabilityTypical use
A3CAsync workers + stale gradsNoisy but once popularLegacy
A2CSync batch updateModerateBaseline parallel
PPOSync + clip + multi-epochStrongDefault

Modern libraries favor synchronous updates — GPUs prefer batched tensors over lock-heavy async.

Shared network batch forward

python
import torch
 
def act_batch(policy, obs_batch):
    """obs_batch: (n_envs, obs_dim)"""
    obs_t = torch.as_tensor(obs_batch, dtype=torch.float32)
    with torch.no_grad():
        dist = policy(obs_t)
        actions = dist.sample()
        log_probs = dist.log_prob(actions)
        values = policy.value_head(obs_t)
    return actions.numpy(), log_probs.numpy(), values.numpy()

Batching amortizes GPU kernel launch overhead.

Sample efficiency vs wall-clock

  • More envs → faster wall-clock to N total steps.
  • PPO multi-epoch → more gradient updates per env step (better sample use, risk overfit).
  • A2C → 1 update per rollout (faster iterations, may need more env steps).

For Lunar Lander project: PPO + 4–8 parallel envs is a sweet spot on laptop CPU.

Stable-Baselines3 one-liner (reference)

python
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
 
vec_env = make_vec_env("LunarLander-v2", n_envs=8)
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=500_000)

Swap PPO for A2C to compare learning curves fairly with same vec env.

Checkpoint — details: Parallelism fixes data hunger and speed, not Markov assumptions — each env still needs proper resets and seeds. Summary: A2C/PPO scale by batching many envs; PPO adds clip for safer reuse of each batch.

Common mistakes

  1. Confusing env steps with gradient steps — 8 envs × 256 steps = 2048 transitions per rollout, not 256.
  2. Different seeds per env — use seed + rank for diversity.
  3. Not auto-resetting vec envs — Gymnasium vec API handles terminal resets; log episode stats from infos.
  4. Huge n_envs on CPU MLP — overhead dominates past ~16 envs on small nets.
  5. Comparing A2C to PPO with different total timesteps — match environment interaction budget.

Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.