← Back to curriculum

Module 6 — Transformers (core of GenAI)

Self-attention & multi-head attention

Tokens attending to tokens, contextual vectors, multiple heads, and causal masking for generation.

~75 min read + exercises

Self-attention and multi-head attention

Before we begin

Self-attention means the sequence attends to itself — each word updates its meaning using context from other words in the same sentence.

“Bank” in “river bank” vs “money bank” — self-attention uses neighbors to disambiguate.

Figure

All-to-all connections

Self-attention: every token can look at every tokenThecatsatdarker = stronger attention weight
Each token row blends information from every column (including itself).

What you will learn

  • Define self-attention vs cross-attention.
  • Explain multi-head attention in plain language.
  • Describe causal masking for text generation.

Before this lesson


Self-attention output

For each token position i, self-attention produces a new vector that mixes:

  • itself
  • words it should focus on (subjects, verbs, punctuation patterns, etc.)

After one layer, “sat” might carry cat-ness; after several layers, richer syntax and semantics.


Multi-head attention

One attention pattern might track subject–verb; another adjective–noun; another coreference (“it” → “cat”).

Multiple heads run parallel attentions with separate Q/K/V projections, then concatenate and project again.

Think: several specialists reading the same sentence, then merging notes.


Cross-attention (preview)

In translation decoders, queries come from the target side while keys/values come from the encoder output — the decoder looks up source sentence pieces when generating each target word. (Lesson 4 expands this.)


Causal (masked) self-attention

When generating text left-to-right, position 5 must not see words 6, 7, 8 during training — that would be cheating.

Causal mask sets future positions to −∞ before softmax → zero weight.

GPT-style models use this everywhere in the decoder stack.


Why transformers beat RNNs here

RNNTransformer self-attention
Sequential stepsParallel over length
Distant words = many hopsDirect link in one layer
Hidden state bottleneckEach token gets custom mix

Checkpoint

Why transformers replaced RNNs for many NLP tasks?

Answer sketch

Parallelizable training and direct long-range connections — often better accuracy and faster on GPUs.


What's next

Lesson 3 — Transformer architecture