Self-attention and multi-head attention

Before we begin

Self-attention means the sequence attends to itself — each word updates its meaning using context from other words in the same sentence.

“Bank” in “river bank” vs “money bank” — self-attention uses neighbors to disambiguate.

Figure

All-to-all connections

Each token row blends information from every column (including itself).

What you will learn

Define self-attention vs cross-attention.
Explain multi-head attention in plain language.
Describe causal masking for text generation.

Before this lesson

Lesson 1 — Attention mechanism

Self-attention output

For each token position i, self-attention produces a new vector that mixes:

itself
words it should focus on (subjects, verbs, punctuation patterns, etc.)

After one layer, “sat” might carry cat-ness; after several layers, richer syntax and semantics.

Multi-head attention

One attention pattern might track subject–verb; another adjective–noun; another coreference (“it” → “cat”).

Multiple heads run parallel attentions with separate Q/K/V projections, then concatenate and project again.

Think: several specialists reading the same sentence, then merging notes.

Cross-attention (preview)

In translation decoders, queries come from the target side while keys/values come from the encoder output — the decoder looks up source sentence pieces when generating each target word. (Lesson 4 expands this.)

Causal (masked) self-attention

When generating text left-to-right, position 5 must not see words 6, 7, 8 during training — that would be cheating.

Causal mask sets future positions to −∞ before softmax → zero weight.

GPT-style models use this everywhere in the decoder stack.

Why transformers beat RNNs here

RNN	Transformer self-attention
Sequential steps	Parallel over length
Distant words = many hops	Direct link in one layer
Hidden state bottleneck	Each token gets custom mix

Checkpoint

Why transformers replaced RNNs for many NLP tasks?

Answer sketch

Parallelizable training and direct long-range connections — often better accuracy and faster on GPUs.

What's next

Lesson 3 — Transformer architecture