LSTM and GRU — long-term memory

Before we begin

LSTM (Long Short-Term Memory) adds a memory cell and gates that decide what to keep, add, and output. GRU is a lighter variant with fewer gates.

What problem does LSTM solve? Remembering useful context across many time steps without vanishing gradients wiping it out.

Figure

LSTM gates

Forget, input, cell, output — control information flow.

What you will learn

Name the three LSTM gates and their roles.
Compare LSTM vs GRU at a practical level.
Know when to pick LSTM/GRU for sentiment and sequences.

Before this lesson

Lesson 2 — RNNs

LSTM intuition

Besides hidden state h, LSTM keeps cell state C — a conveyor belt of memory.

Gate	Role
Forget	Drop irrelevant old cell content
Input	Add new candidate information
Output	What to expose as hidden state h

Gates use sigmoid (0–1) to scale flows — differentiable “switches.”

GRU

Combines cell and hidden into one stream with:

Reset gate — how much past to ignore when computing candidate
Update gate — blend old hidden vs new candidate

Often similar accuracy to LSTM with fewer parameters — good default to try first on medium text tasks.

LSTM for sentiment

Review: “Not perfect, but honestly pretty good overall.”

LSTM can link “not perfect” with later “pretty good” better than bag-of-words — order and contrast matter.

Your Module 4 project uses LSTM on product reviews.

vs Transformers (preview)

Transformers (Module 6) attend to all tokens at once — often beat LSTM on long text today. LSTM/GRU remain valuable to understand sequential modeling and for small edge deployments.

Checkpoint

What does the forget gate do?

Answer sketch

It decides how much of the old cell state to erase before writing new information.

What's next

Lesson 4 — Word embeddings