RNNs — sequences and hidden state

Before we begin

Reviews, stock prices, and audio frames are sequences — order matters. Recurrent Neural Networks (RNNs) process one time step at a time and carry a hidden state forward.

“Great” then “not” means something different than “not” then “great”. RNNs model that order.

Figure

Unrolled RNN

Same weights at each step; hidden state h carries context.

What you will learn

Define hidden state and unrolling.
Explain backprop through time at a high level.
State why vanilla RNNs struggle on long sequences.

Before this lesson

Lesson 1 — CNNs

One step of vanilla RNN

At time t:

hₜ = activation(W_h hₜ₋₁ + W_x xₜ + b)

xₜ — input at step t (e.g. word embedding)
hₜ — summary of the sequence so far
Same W_h, W_x at every step — shared across time

For sentiment, h at the last word can feed a classifier (positive / negative).

Backprop through time (BPTT)

Training unrolls the RNN over all steps, computes loss (e.g. at final step), then backprops through every time step.

Gradients flow through repeated multiplications by W_h — if values are small → vanishing; large → exploding.

Long sequence problem

Plain RNNs forget context from many steps ago:

“The movie was not … good” — negation far from “good” is hard.
Long documents lose early sentences’ influence.

That motivated LSTM and GRU (next lesson).

RNN vs CNN (when to use which)

Data	Typical architecture
Images	CNN
Text / time series	RNN, LSTM, GRU (or transformers later)
Fixed tabular rows	MLP / gradient boosting

Checkpoint

Why do RNNs struggle with long sequences?

Answer sketch

Gradients over many time steps vanish or explode — hidden state cannot reliably store information from distant past steps in a vanilla RNN.

What's next

Lesson 3 — LSTM & GRU