← Back to curriculum

Module 6 — Transformers (core of GenAI)

Attention mechanism — Query, Key, Value

Why RNNs bottleneck long-range links, the library analogy for Q/K/V, scaled dot-product steps, softmax weights, and cross- vs self-attention preview.

~85 min read + exercises

Attention mechanism — Query, Key, Value

Before we begin

In Module 4 you saw RNNs and LSTMs read text one token at a time, carrying a single hidden state forward like a backpack. That works for short sentences — but real language needs selective focus.

When you read “The cat that chased the mouse sat on the mat”, understanding sat requires linking back to cat, not mat or mouse. The link is long-range and specific. RNNs must squeeze all of that through one vector updated at every step. Important details get diluted or forgotten.

Attention fixes this. Instead of only remembering the last hidden state, each position can look directly at every other position and ask: “Which words are relevant to me right now?”

That idea — introduced for machine translation in the 2010s and scaled up in the 2017 Transformer paper — is the core of GPT, BERT, Claude, and nearly every modern GenAI model.

What problem does attention solve? Long-range dependencies and selective routing — connecting distant words without forcing everything through a narrow sequential bottleneck.

Figure

Query, Key, Value

Attention: score Query against Keys, blend ValuesQKscoresV mix
Score how well a Query matches each Key, then blend the corresponding Value vectors — a differentiable soft lookup.

What you will learn

  • State clearly what problem attention solves (and what it does not solve).
  • Explain Query, Key, and Value with a concrete sentence example.
  • Walk through the four steps of attention: scores → weights → blend.
  • Understand why softmax turns scores into a probability-like focus pattern.
  • Contrast attention with the RNN hidden-state bottleneck.
  • Preview self-attention vs cross-attention (Lesson 2 goes deeper).

Before this lesson


The bottleneck RNNs create

Picture an RNN translating English → French. At step 5 it must produce a French verb. To pick the right gender and number, it may need a noun from step 1 — ten steps ago.

The RNN only has h₄ (one fixed-size vector) to carry that information. Early words compete for space in the same backpack:

ProblemWhat happens
CompressionMany tokens must share one small vector
DistanceInformation from step 1 passes through many multiplications to reach step 10
No explicit linksThe model never directly says “word 5 attends to word 1” — it must hope h stores that

LSTMs and GRUs help with vanishing gradients, but they do not remove the bottleneck: still one hidden state per step, still sequential.

Attention adds an explicit routing mechanism: at each output step, the model can pull information from any input position it needs.


What is attention? (plain language)

Attention is a way to compute a new vector for one position by taking a weighted average of vectors from other positions — where the weights come from how relevant each source is.

In one sentence:

Attention = “How much should I listen to each other word?” → blend their information accordingly.

It is soft (weights are continuous, not hard 0/1) and differentiable (gradients flow through weights, so training learns what to attend to).


The library analogy (Query, Key, Value)

Think of a library search:

RoleSymbolLibrary analogyIn a sentence
QueryQYour question: “I need info about the subject of this verb”What the current word is looking for
KeyKLabels on shelves: “grammar”, “subject noun”, “location”How each other word advertises what it offers
ValueVThe actual books on the shelfThe content each word contributes if selected

Process:

  1. Compare your Query to every shelf Key → compatibility scores.
  2. Turn scores into weights (high match = high weight).
  3. Read mostly the Value books from shelves with high weights.

Neural attention does the same with vectors and learned linear layers instead of humans and paper.

Important: Q, K, and V are not three copies of the word embedding. They are three different projections of the input — three “views” of the same token, learned during training so the model discovers useful roles.


Worked example: “The cat sat”

Take a tiny sentence (3 tokens). Each token already has an embedding vector from Module 4.

When we build a richer representation for sat, attention might learn patterns like:

TokenRole for understanding satIntuition
TheLow weightArticle — little semantic content for the verb
catHigh weightSubject — who sat?
satMedium weightOften attends to itself too

So the new vector for sat ≈ mostly cat’s information + a bit of sat + almost none of The.

Different words ask different questions:

  • cat might attend strongly to sat (predicate link).
  • The might spread weight evenly (weak, structural).

The model learns these patterns from data — we do not hand-code “verbs attend to subjects.”


The four steps (conceptual → formula)

For one Query position (e.g. the word sat):

Step 1 — Build Q, K, V

From input embeddings, multiply by learned matrices:

  • Q = “what am I looking for?”
  • K = “what do I offer?”
  • V = “what content do I pass if chosen?”

Every token gets its own Q, K, and V.

Step 2 — Compatibility scores

Compare the Query at position i to every Key (all positions):

scorei,j=QiKj\text{score}_{i,j} = Q_i \cdot K_j

High dot product → vectors point in similar directions in learned space → compatible.

In translation (cross-attention), Q often comes from the target language word being generated; K and V come from the source sentence. The French decoder queries the English encoder: “Which English words matter for this French word?”

Step 3 — Softmax → attention weights

Raw scores can be any real numbers. Softmax converts them to weights that are:

  • Positive
  • Sum to 1 (like a probability distribution over positions)

weighti,j=softmax(scorei,j)\text{weight}_{i,j} = \text{softmax}(\text{score}_{i,j})

So position i “spends” 100% of its focus budget across all keys. If cat gets 0.7 and The gets 0.05, the model is confident about what matters.

Why softmax? It gives a smooth, trainable way to pick many sources with different strengths — not winner-take-all, unless the model learns very peaked scores.

Scaled dot-product: In practice, scores are divided by dk\sqrt{d_k} (dimension of keys) so dot products do not grow huge and make softmax too sharp. You will see this written as “scaled dot-product attention” — same idea, stabler training.

Step 4 — Weighted sum of Values

outputi=jweighti,jVj\text{output}_i = \sum_j \text{weight}_{i,j} \cdot V_j

The new vector for sat is a blend of all Value vectors, emphasizing cat’s V.

That output replaces (or is added to) the old representation — richer context-aware meaning.

Figure

End-to-end flow

Attention: score Query against Keys, blend ValuesQKscoresV mix
Same diagram as above — now you know what each box means.

Attention vs one big hidden state

RNN hidden stateAttention
Information pathSequential — must pass through every stepDirect — any position can link to any other
CapacityOne vector for everything so farEach position gets a custom mix
InterpretabilityOpaque compressed memoryWeights can be visualized (which words linked)
ParallelismStep t waits for t−1All positions can be processed together (Lesson 2)

Attention does not delete the need for depth — transformers stack many attention layers so meaning builds gradually (syntax → semantics → discourse). One layer might link subject–verb; a later layer might link “it”“cat” across a paragraph.


Self-attention vs cross-attention (preview)

Both use Q, K, V. The difference is where Q and K/V come from:

TypeQ fromK, V fromTypical use
Cross-attentionDecoder (target)Encoder (source)Translation: French word looks at English sentence
Self-attentionSame sequenceSame sequenceGPT/BERT: each word looks at other words in the same sentence

Lesson 2 focuses on self-attention, multi-head attention, and causal masking — the heart of modern LLMs.


Where you meet attention in GenAI

SystemHow attention shows up
GPT-style chatStacked causal self-attention — each token sees only past tokens when generating
BERT / embeddingsBidirectional self-attention — each token sees left and right (reading, not generating)
Vision modelsPatch attention — image patches attend to other patches (same QKV idea)
RAG + LLMRetrieval is separate; inside the LLM, attention still mixes tokens in your prompt

When people say transformers “understand context,” they largely mean: attention layers rewrote each token’s vector using its neighbors.


Common misconceptions

“Attention is a memory database.”
No — it is a computation that blends vectors in one forward pass. Long-term memory in chat apps comes from context window + RAG + fine-tuning, not attention alone.

“Higher attention weight = human importance.”
Weights are model-internal and useful for debugging, but they are not guaranteed to match human judgment.

“One attention layer understands the whole document.”
Usually many layers are needed. Early layers often track local grammar; deeper layers handle harder references.

“Attention replaces embeddings.”
Embeddings still map token IDs to vectors. Attention mixes those vectors using Q, K, V.


Mini recap

  1. Problem: RNNs compress the whole past into one vector — bad for long, selective dependencies.
  2. Idea: Let each position query all others and blend their values.
  3. Q, K, V: Three learned roles — question, label, content.
  4. Scores → softmax → weighted sum of Values.
  5. Result: Context-aware vectors that power transformers and modern GenAI.

Checkpoint

Test yourself before Lesson 2.

1. What problem does attention solve?

Answer sketch

Routing information between positions — especially distant ones — without forcing everything through a single sequential hidden state. Each output can focus on the inputs that matter for that word.

2. In the library analogy, what is the difference between Key and Value?

Answer sketch

Key is the label used for matching (compatibility). Value is the content you actually retrieve and blend when the match is strong.

3. Why use softmax on attention scores?

Answer sketch

To get non-negative weights that sum to 1 — a soft focus distribution over all positions, differentiable for training.

4. When translating English → French, does the French word’s Query attend to English Keys or French Keys?

Answer sketch

English Keys (and Values) from the encoder — cross-attention. The decoder queries the source sentence.


What's next

You know what attention does and how Q, K, V fit together. Next: every token in a sentence attends to every other token in the same sentenceself-attention, multiple heads, and the mask GPT uses when generating text.

Lesson 2 — Self-attention & multi-head attention