Returns, discounting & episodes
Before we begin
Algorithms do not maximize a single reward — they maximize return: cumulative reward over time, possibly discounted. Discounting with γ answers two needs: keep infinite sums finite in continuing tasks, and model that near-term outcomes often matter more than the distant future.
You will compute returns by hand, distinguish episodic from continuing tasks, and see why Monte Carlo methods wait for episode boundaries.
What you will learn
- Define return Gₜ for finite and infinite horizons.
- Compute discounted returns with γ ∈ (0, 1).
- Distinguish episodic vs continuing tasks.
- Explain why γ < 1 stabilizes infinite-horizon objectives.
- Relate return to what policy gradient and value-based methods optimize.
Finite-horizon return
For an episode that ends at time T:
Gₜ = rₜ₊₁ + rₜ₊₂ + … + r_T
Example: rewards (+1, +1, +1, +10) starting at t=0:
G₀ = 1 + 1 + 1 + 10 = 13
No discounting — every step counts equally. Common in episodic games with clear termination.
Discounted return
For infinite or continuing horizons:
Gₜ = rₜ₊₁ + γ rₜ₊₂ + γ² rₜ₊₃ + …
| γ | Interpretation |
|---|---|
| 0 | Myopic — only next reward matters |
| 0.9 | Long horizon, mild discount |
| 0.99 | Very far-sighted |
| 1 | Undiscounted (needs finite episodes or average reward) |
Worked numeric example
Rewards: (+1, +1, +1, …) forever, γ = 0.9.
G₀ = 1 + 0.9 + 0.9² + 0.9³ + … = 1 / (1 − 0.9) = 10
def discounted_return(rewards, gamma=0.99):
G = 0.0
power = 1.0
for r in rewards:
G += power * r
power *= gamma
return G
print(discounted_return([1, 1, 1, 1], gamma=0.9)) # 3.439
print(discounted_return([1] * 100, gamma=0.9)) # ~9.99 (approaches 10)Checkpoint: Same reward sequence, γ = 0 vs γ = 0.9 — which G₀ is larger?
Answer
γ = 0.9 gives G₀ = r₁ = 1 only if we define Gₜ starting at first reward… For full sequence starting t=0 with rewards (1,1,1): γ=0 → G=1; γ=0.9 → G=1+0.9+0.81=2.71. Higher γ weights later +1s more.
Episodic vs continuing tasks
| Type | Episode end | Examples |
|---|---|---|
| Episodic | Terminal state or time limit | Chess, CartPole, maze exit |
| Continuing | No natural end | Portfolio control, server autoscaling |
For episodic tasks, return is usually computed until termination. For continuing tasks, γ < 1 is standard so Gₜ stays finite.
Bootstrapping preview
Some methods estimate Gₜ from partial trajectories plus a value estimate of the tail — temporal-difference learning (Module 2). Monte Carlo uses full episode returns.
Reward shaping and scale
Returns depend on reward design:
| Design choice | Effect |
|---|---|
| −1 per step in maze | Encourages shortest path |
| +0.01 per step alive | Encourages survival (may never finish) |
| Sparse +1 at goal only | Hard exploration |
Scaling all rewards by c scales returns by c — relative comparisons stay the same, but learning rates may need tuning with function approximators.
Computing return backward (useful trick)
Gₜ = rₜ₊₁ + γ Gₜ₊₁
If you know rewards (1, 1, 10) from t=0 to T=2:
- G₂ = 10
- G₁ = 1 + γ·10
- G₀ = 1 + γ·G₁
This recurrence connects directly to Bellman equations in the next lesson.
Common mistakes
- Using γ > 1 (blows up returns unless rewards decay).
- Mixing undiscounted reporting with discounted training objective.
- Ignoring truncation — time-limited CartPole still gets a valid return, but “success” may mean surviving 500 steps, not true terminal.
- Treating average reward criteria as identical to discounted return (advanced topic; different theory).
Return is the number every RL algorithm chases. With MDPs and returns defined, we can write value functions and the Bellman equations that make dynamic programming and Q-learning possible.