LSTM and GRU — long-term memory
Before we begin
LSTM (Long Short-Term Memory) adds a memory cell and gates that decide what to keep, add, and output. GRU is a lighter variant with fewer gates.
What problem does LSTM solve? Remembering useful context across many time steps without vanishing gradients wiping it out.
Figure
LSTM gates
What you will learn
- Name the three LSTM gates and their roles.
- Compare LSTM vs GRU at a practical level.
- Know when to pick LSTM/GRU for sentiment and sequences.
Before this lesson
LSTM intuition
Besides hidden state h, LSTM keeps cell state C — a conveyor belt of memory.
| Gate | Role |
|---|---|
| Forget | Drop irrelevant old cell content |
| Input | Add new candidate information |
| Output | What to expose as hidden state h |
Gates use sigmoid (0–1) to scale flows — differentiable “switches.”
GRU
Combines cell and hidden into one stream with:
- Reset gate — how much past to ignore when computing candidate
- Update gate — blend old hidden vs new candidate
Often similar accuracy to LSTM with fewer parameters — good default to try first on medium text tasks.
LSTM for sentiment
Review: “Not perfect, but honestly pretty good overall.”
LSTM can link “not perfect” with later “pretty good” better than bag-of-words — order and contrast matter.
Your Module 4 project uses LSTM on product reviews.
vs Transformers (preview)
Transformers (Module 6) attend to all tokens at once — often beat LSTM on long text today. LSTM/GRU remain valuable to understand sequential modeling and for small edge deployments.
Checkpoint
What does the forget gate do?
Answer sketch
It decides how much of the old cell state to erase before writing new information.