RNNs — sequences and hidden state
Before we begin
Reviews, stock prices, and audio frames are sequences — order matters. Recurrent Neural Networks (RNNs) process one time step at a time and carry a hidden state forward.
“Great” then “not” means something different than “not” then “great”. RNNs model that order.
Figure
Unrolled RNN
What you will learn
- Define hidden state and unrolling.
- Explain backprop through time at a high level.
- State why vanilla RNNs struggle on long sequences.
Before this lesson
One step of vanilla RNN
At time t:
hₜ = activation(W_h hₜ₋₁ + W_x xₜ + b)
- xₜ — input at step t (e.g. word embedding)
- hₜ — summary of the sequence so far
- Same W_h, W_x at every step — shared across time
For sentiment, h at the last word can feed a classifier (positive / negative).
Backprop through time (BPTT)
Training unrolls the RNN over all steps, computes loss (e.g. at final step), then backprops through every time step.
Gradients flow through repeated multiplications by W_h — if values are small → vanishing; large → exploding.
Long sequence problem
Plain RNNs forget context from many steps ago:
- “The movie was not … good” — negation far from “good” is hard.
- Long documents lose early sentences’ influence.
That motivated LSTM and GRU (next lesson).
RNN vs CNN (when to use which)
| Data | Typical architecture |
|---|---|
| Images | CNN |
| Text / time series | RNN, LSTM, GRU (or transformers later) |
| Fixed tabular rows | MLP / gradient boosting |
Checkpoint
Why do RNNs struggle with long sequences?
Answer sketch
Gradients over many time steps vanish or explode — hidden state cannot reliably store information from distant past steps in a vanilla RNN.