← Back to curriculum

Module 1 — RL foundations & MDPs

Returns, discounting & episodes

Discounted return, episodic vs continuing tasks, and numeric worked examples.

~55 min read + exercises

Returns, discounting & episodes

Before we begin

Algorithms do not maximize a single reward — they maximize return: cumulative reward over time, possibly discounted. Discounting with γ answers two needs: keep infinite sums finite in continuing tasks, and model that near-term outcomes often matter more than the distant future.

You will compute returns by hand, distinguish episodic from continuing tasks, and see why Monte Carlo methods wait for episode boundaries.


What you will learn

  • Define return Gₜ for finite and infinite horizons.
  • Compute discounted returns with γ ∈ (0, 1).
  • Distinguish episodic vs continuing tasks.
  • Explain why γ < 1 stabilizes infinite-horizon objectives.
  • Relate return to what policy gradient and value-based methods optimize.

Finite-horizon return

For an episode that ends at time T:

Gₜ = rₜ₊₁ + rₜ₊₂ + … + r_T

Example: rewards (+1, +1, +1, +10) starting at t=0:

G₀ = 1 + 1 + 1 + 10 = 13

No discounting — every step counts equally. Common in episodic games with clear termination.


Discounted return

For infinite or continuing horizons:

Gₜ = rₜ₊₁ + γ rₜ₊₂ + γ² rₜ₊₃ + …

γInterpretation
0Myopic — only next reward matters
0.9Long horizon, mild discount
0.99Very far-sighted
1Undiscounted (needs finite episodes or average reward)

Worked numeric example

Rewards: (+1, +1, +1, …) forever, γ = 0.9.

G₀ = 1 + 0.9 + 0.9² + 0.9³ + … = 1 / (1 − 0.9) = 10

python
def discounted_return(rewards, gamma=0.99):
    G = 0.0
    power = 1.0
    for r in rewards:
        G += power * r
        power *= gamma
    return G
 
print(discounted_return([1, 1, 1, 1], gamma=0.9))  # 3.439
print(discounted_return([1] * 100, gamma=0.9))     # ~9.99 (approaches 10)

Checkpoint: Same reward sequence, γ = 0 vs γ = 0.9 — which G₀ is larger?

Answer

γ = 0.9 gives G₀ = r₁ = 1 only if we define Gₜ starting at first reward… For full sequence starting t=0 with rewards (1,1,1): γ=0 → G=1; γ=0.9 → G=1+0.9+0.81=2.71. Higher γ weights later +1s more.


Episodic vs continuing tasks

TypeEpisode endExamples
EpisodicTerminal state or time limitChess, CartPole, maze exit
ContinuingNo natural endPortfolio control, server autoscaling

For episodic tasks, return is usually computed until termination. For continuing tasks, γ < 1 is standard so Gₜ stays finite.

Bootstrapping preview

Some methods estimate Gₜ from partial trajectories plus a value estimate of the tail — temporal-difference learning (Module 2). Monte Carlo uses full episode returns.


Reward shaping and scale

Returns depend on reward design:

Design choiceEffect
−1 per step in mazeEncourages shortest path
+0.01 per step aliveEncourages survival (may never finish)
Sparse +1 at goal onlyHard exploration

Scaling all rewards by c scales returns by c — relative comparisons stay the same, but learning rates may need tuning with function approximators.


Computing return backward (useful trick)

Gₜ = rₜ₊₁ + γ Gₜ₊₁

If you know rewards (1, 1, 10) from t=0 to T=2:

  • G₂ = 10
  • G₁ = 1 + γ·10
  • G₀ = 1 + γ·G₁

This recurrence connects directly to Bellman equations in the next lesson.


Common mistakes

  • Using γ > 1 (blows up returns unless rewards decay).
  • Mixing undiscounted reporting with discounted training objective.
  • Ignoring truncation — time-limited CartPole still gets a valid return, but “success” may mean surviving 500 steps, not true terminal.
  • Treating average reward criteria as identical to discounted return (advanced topic; different theory).

Return is the number every RL algorithm chases. With MDPs and returns defined, we can write value functions and the Bellman equations that make dynamic programming and Q-learning possible.


Before this lesson


What's next