Returns, discounting & episodes

Before we begin

Algorithms do not maximize a single reward — they maximize return: cumulative reward over time, possibly discounted. Discounting with γ answers two needs: keep infinite sums finite in continuing tasks, and model that near-term outcomes often matter more than the distant future.

You will compute returns by hand, distinguish episodic from continuing tasks, and see why Monte Carlo methods wait for episode boundaries.

What you will learn

Define return Gₜ for finite and infinite horizons.
Compute discounted returns with γ ∈ (0, 1).
Distinguish episodic vs continuing tasks.
Explain why γ < 1 stabilizes infinite-horizon objectives.
Relate return to what policy gradient and value-based methods optimize.

Finite-horizon return

For an episode that ends at time T:

Gₜ = rₜ₊₁ + rₜ₊₂ + … + r_T

Example: rewards (+1, +1, +1, +10) starting at t=0:

G₀ = 1 + 1 + 1 + 10 = 13

No discounting — every step counts equally. Common in episodic games with clear termination.

Discounted return

For infinite or continuing horizons:

Gₜ = rₜ₊₁ + γ rₜ₊₂ + γ² rₜ₊₃ + …

γ	Interpretation
0	Myopic — only next reward matters
0.9	Long horizon, mild discount
0.99	Very far-sighted
1	Undiscounted (needs finite episodes or average reward)

Worked numeric example

Rewards: (+1, +1, +1, …) forever, γ = 0.9.

G₀ = 1 + 0.9 + 0.9² + 0.9³ + … = 1 / (1 − 0.9) = 10

python

def discounted_return(rewards, gamma=0.99):
    G = 0.0
    power = 1.0
    for r in rewards:
        G += power * r
        power *= gamma
    return G
 
print(discounted_return([1, 1, 1, 1], gamma=0.9))  # 3.439
print(discounted_return([1] * 100, gamma=0.9))     # ~9.99 (approaches 10)

Checkpoint: Same reward sequence, γ = 0 vs γ = 0.9 — which G₀ is larger?

Answer

γ = 0.9 gives G₀ = r₁ = 1 only if we define Gₜ starting at first reward… For full sequence starting t=0 with rewards (1,1,1): γ=0 → G=1; γ=0.9 → G=1+0.9+0.81=2.71. Higher γ weights later +1s more.

Episodic vs continuing tasks

Type	Episode end	Examples
Episodic	Terminal state or time limit	Chess, CartPole, maze exit
Continuing	No natural end	Portfolio control, server autoscaling

For episodic tasks, return is usually computed until termination. For continuing tasks, γ < 1 is standard so Gₜ stays finite.

Bootstrapping preview

Some methods estimate Gₜ from partial trajectories plus a value estimate of the tail — temporal-difference learning (Module 2). Monte Carlo uses full episode returns.

Reward shaping and scale

Returns depend on reward design:

Design choice	Effect
−1 per step in maze	Encourages shortest path
+0.01 per step alive	Encourages survival (may never finish)
Sparse +1 at goal only	Hard exploration

Scaling all rewards by c scales returns by c — relative comparisons stay the same, but learning rates may need tuning with function approximators.

Computing return backward (useful trick)

Gₜ = rₜ₊₁ + γ Gₜ₊₁

If you know rewards (1, 1, 10) from t=0 to T=2:

G₂ = 10
G₁ = 1 + γ·10
G₀ = 1 + γ·G₁

This recurrence connects directly to Bellman equations in the next lesson.

Common mistakes

Using γ > 1 (blows up returns unless rewards decay).
Mixing undiscounted reporting with discounted training objective.
Ignoring truncation — time-limited CartPole still gets a valid return, but “success” may mean surviving 500 steps, not true terminal.
Treating average reward criteria as identical to discounted return (advanced topic; different theory).

Return is the number every RL algorithm chases. With MDPs and returns defined, we can write value functions and the Bellman equations that make dynamic programming and Q-learning possible.

Before this lesson

Previous lesson

What's next

Next lesson — Bellman equations & value functions