Self-attention and multi-head attention
Before we begin
Self-attention means the sequence attends to itself — each word updates its meaning using context from other words in the same sentence.
“Bank” in “river bank” vs “money bank” — self-attention uses neighbors to disambiguate.
Figure
All-to-all connections
What you will learn
- Define self-attention vs cross-attention.
- Explain multi-head attention in plain language.
- Describe causal masking for text generation.
Before this lesson
Self-attention output
For each token position i, self-attention produces a new vector that mixes:
- itself
- words it should focus on (subjects, verbs, punctuation patterns, etc.)
After one layer, “sat” might carry cat-ness; after several layers, richer syntax and semantics.
Multi-head attention
One attention pattern might track subject–verb; another adjective–noun; another coreference (“it” → “cat”).
Multiple heads run parallel attentions with separate Q/K/V projections, then concatenate and project again.
Think: several specialists reading the same sentence, then merging notes.
Cross-attention (preview)
In translation decoders, queries come from the target side while keys/values come from the encoder output — the decoder looks up source sentence pieces when generating each target word. (Lesson 4 expands this.)
Causal (masked) self-attention
When generating text left-to-right, position 5 must not see words 6, 7, 8 during training — that would be cheating.
Causal mask sets future positions to −∞ before softmax → zero weight.
GPT-style models use this everywhere in the decoder stack.
Why transformers beat RNNs here
| RNN | Transformer self-attention |
|---|---|
| Sequential steps | Parallel over length |
| Distant words = many hops | Direct link in one layer |
| Hidden state bottleneck | Each token gets custom mix |
Checkpoint
Why transformers replaced RNNs for many NLP tasks?
Answer sketch
Parallelizable training and direct long-range connections — often better accuracy and faster on GPUs.