Why learn policies directly
Before we begin
Value-based methods like DQN learn Q(s,a) then derive a policy via argmax. Policy gradient methods parameterize π(a|s; θ) directly and optimize expected return with gradient ascent. That matters for continuous actions, stochastic policies, and problems where the best policy is not deterministic.
Learning objectives
- Contrast value-based vs policy-based vs actor–critic approaches.
- State when argmax Q is insufficient (continuous actions, large discrete spaces).
- Write the policy objective J(θ) = E_π [sum of discounted rewards].
- Recognize stochastic policies as a built-in exploration mechanism.
- Map policy outputs to Gymnasium action spaces (discrete softmax, continuous Gaussian).
Three families of RL algorithms
| Family | Learns | Policy extraction | Typical use |
|---|---|---|---|
| Value-based | Q(s,a) or V(s) | ε-greedy or argmax | Discrete actions, Atari |
| Policy-based | π(a | s; θ) | Direct sampling |
| Actor–critic | π and V or Q | Sample from π, critic reduces variance | PPO, SAC, modern default |
DQN cannot output a torque of −0.37 N·m without discretizing into hundreds of bins — policy gradients output continuous parameters naturally.
Policy parameterization examples
Discrete (CartPole): softmax logits → categorical distribution.
import torch
import torch.nn as nn
from torch.distributions import Categorical
class PolicyDiscrete(nn.Module):
def __init__(self, obs_dim, n_actions, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def forward(self, obs):
logits = self.net(obs)
return Categorical(logits=logits)
# usage
obs = torch.randn(4)
dist = policy(obs)
action = dist.sample()
log_prob = dist.log_prob(action)Continuous (Pendulum): Gaussian mean and learned log_std.
from torch.distributions import Normal
class PolicyContinuous(nn.Module):
def __init__(self, obs_dim, action_dim, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, action_dim),
)
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, obs):
mean = self.net(obs)
std = self.log_std.exp()
return Normal(mean, std)Worked example — why stochastic policies help
Rock–paper–scissors against an opponent who beats your last move. A deterministic policy is exploitable: opponent counters every time. A mixed policy (each action 1/3) is unexploitable in expectation. Even in MDPs with optimal deterministic policies, stochastic behavior during training explores without ε-greedy hacks on a separate value function.
Objective function
Maximize expected discounted return:
J(θ) = E_{τ ~ π_θ} [ G_0 ]where trajectory τ is a sequence of states and actions sampled from the environment under π_θ. No argmax — gradients flow through log π(a|s; θ) (covered in the next lesson).
When to prefer policy gradients
| Situation | Value-based | Policy gradient |
|---|---|---|
| Discrete, moderate actions | Strong (DQN) | Works (REINFORCE) |
| Continuous actions | Weak without discretization | Natural |
| Stochastic optimal policy | Suboptimal if forced deterministic | Natural |
| High-dimensional action | argmax expensive | Factorized distributions |
| Need policy entropy / safety | Boltzmann on Q only | Direct entropy bonus |
Checkpoint — details: If your action space is Box in Gymnasium, start with policy gradients or actor–critic, not DQN. Summary: Learn π directly when the policy itself is the object you need or when argmax Q is awkward.
Common mistakes
- Using DQN on continuous Box actions without discretization — coarse bins destroy control quality.
- Forgetting to squash continuous actions to env bounds — use tanh scaling to low/high.
- Deterministic policy at initialization — zero logits → uniform or degenerate; check entropy.
- Confusing policy loss sign — we maximize return, so ascent on J (or minimize negative J).
- Ignoring action masking — invalid moves in games need masked softmax, not raw logits.