Attention mechanism — Query, Key, Value
Before we begin
In Module 4 you saw RNNs and LSTMs read text one token at a time, carrying a single hidden state forward like a backpack. That works for short sentences — but real language needs selective focus.
When you read “The cat that chased the mouse sat on the mat”, understanding sat requires linking back to cat, not mat or mouse. The link is long-range and specific. RNNs must squeeze all of that through one vector updated at every step. Important details get diluted or forgotten.
Attention fixes this. Instead of only remembering the last hidden state, each position can look directly at every other position and ask: “Which words are relevant to me right now?”
That idea — introduced for machine translation in the 2010s and scaled up in the 2017 Transformer paper — is the core of GPT, BERT, Claude, and nearly every modern GenAI model.
What problem does attention solve? Long-range dependencies and selective routing — connecting distant words without forcing everything through a narrow sequential bottleneck.
Figure
Query, Key, Value
What you will learn
- State clearly what problem attention solves (and what it does not solve).
- Explain Query, Key, and Value with a concrete sentence example.
- Walk through the four steps of attention: scores → weights → blend.
- Understand why softmax turns scores into a probability-like focus pattern.
- Contrast attention with the RNN hidden-state bottleneck.
- Preview self-attention vs cross-attention (Lesson 2 goes deeper).
Before this lesson
- Module 6 welcome
- Module 4 — RNNs — hidden state and long-sequence limits
- Module 4 — Word embeddings — each word is already a vector
The bottleneck RNNs create
Picture an RNN translating English → French. At step 5 it must produce a French verb. To pick the right gender and number, it may need a noun from step 1 — ten steps ago.
The RNN only has h₄ (one fixed-size vector) to carry that information. Early words compete for space in the same backpack:
| Problem | What happens |
|---|---|
| Compression | Many tokens must share one small vector |
| Distance | Information from step 1 passes through many multiplications to reach step 10 |
| No explicit links | The model never directly says “word 5 attends to word 1” — it must hope h stores that |
LSTMs and GRUs help with vanishing gradients, but they do not remove the bottleneck: still one hidden state per step, still sequential.
Attention adds an explicit routing mechanism: at each output step, the model can pull information from any input position it needs.
What is attention? (plain language)
Attention is a way to compute a new vector for one position by taking a weighted average of vectors from other positions — where the weights come from how relevant each source is.
In one sentence:
Attention = “How much should I listen to each other word?” → blend their information accordingly.
It is soft (weights are continuous, not hard 0/1) and differentiable (gradients flow through weights, so training learns what to attend to).
The library analogy (Query, Key, Value)
Think of a library search:
| Role | Symbol | Library analogy | In a sentence |
|---|---|---|---|
| Query | Q | Your question: “I need info about the subject of this verb” | What the current word is looking for |
| Key | K | Labels on shelves: “grammar”, “subject noun”, “location” | How each other word advertises what it offers |
| Value | V | The actual books on the shelf | The content each word contributes if selected |
Process:
- Compare your Query to every shelf Key → compatibility scores.
- Turn scores into weights (high match = high weight).
- Read mostly the Value books from shelves with high weights.
Neural attention does the same with vectors and learned linear layers instead of humans and paper.
Important: Q, K, and V are not three copies of the word embedding. They are three different projections of the input — three “views” of the same token, learned during training so the model discovers useful roles.
Worked example: “The cat sat”
Take a tiny sentence (3 tokens). Each token already has an embedding vector from Module 4.
When we build a richer representation for sat, attention might learn patterns like:
| Token | Role for understanding sat | Intuition |
|---|---|---|
| The | Low weight | Article — little semantic content for the verb |
| cat | High weight | Subject — who sat? |
| sat | Medium weight | Often attends to itself too |
So the new vector for sat ≈ mostly cat’s information + a bit of sat + almost none of The.
Different words ask different questions:
- cat might attend strongly to sat (predicate link).
- The might spread weight evenly (weak, structural).
The model learns these patterns from data — we do not hand-code “verbs attend to subjects.”
The four steps (conceptual → formula)
For one Query position (e.g. the word sat):
Step 1 — Build Q, K, V
From input embeddings, multiply by learned matrices:
- Q = “what am I looking for?”
- K = “what do I offer?”
- V = “what content do I pass if chosen?”
Every token gets its own Q, K, and V.
Step 2 — Compatibility scores
Compare the Query at position i to every Key (all positions):
High dot product → vectors point in similar directions in learned space → compatible.
In translation (cross-attention), Q often comes from the target language word being generated; K and V come from the source sentence. The French decoder queries the English encoder: “Which English words matter for this French word?”
Step 3 — Softmax → attention weights
Raw scores can be any real numbers. Softmax converts them to weights that are:
- Positive
- Sum to 1 (like a probability distribution over positions)
So position i “spends” 100% of its focus budget across all keys. If cat gets 0.7 and The gets 0.05, the model is confident about what matters.
Why softmax? It gives a smooth, trainable way to pick many sources with different strengths — not winner-take-all, unless the model learns very peaked scores.
Scaled dot-product: In practice, scores are divided by (dimension of keys) so dot products do not grow huge and make softmax too sharp. You will see this written as “scaled dot-product attention” — same idea, stabler training.
Step 4 — Weighted sum of Values
The new vector for sat is a blend of all Value vectors, emphasizing cat’s V.
That output replaces (or is added to) the old representation — richer context-aware meaning.
Figure
End-to-end flow
Attention vs one big hidden state
| RNN hidden state | Attention | |
|---|---|---|
| Information path | Sequential — must pass through every step | Direct — any position can link to any other |
| Capacity | One vector for everything so far | Each position gets a custom mix |
| Interpretability | Opaque compressed memory | Weights can be visualized (which words linked) |
| Parallelism | Step t waits for t−1 | All positions can be processed together (Lesson 2) |
Attention does not delete the need for depth — transformers stack many attention layers so meaning builds gradually (syntax → semantics → discourse). One layer might link subject–verb; a later layer might link “it” → “cat” across a paragraph.
Self-attention vs cross-attention (preview)
Both use Q, K, V. The difference is where Q and K/V come from:
| Type | Q from | K, V from | Typical use |
|---|---|---|---|
| Cross-attention | Decoder (target) | Encoder (source) | Translation: French word looks at English sentence |
| Self-attention | Same sequence | Same sequence | GPT/BERT: each word looks at other words in the same sentence |
Lesson 2 focuses on self-attention, multi-head attention, and causal masking — the heart of modern LLMs.
Where you meet attention in GenAI
| System | How attention shows up |
|---|---|
| GPT-style chat | Stacked causal self-attention — each token sees only past tokens when generating |
| BERT / embeddings | Bidirectional self-attention — each token sees left and right (reading, not generating) |
| Vision models | Patch attention — image patches attend to other patches (same QKV idea) |
| RAG + LLM | Retrieval is separate; inside the LLM, attention still mixes tokens in your prompt |
When people say transformers “understand context,” they largely mean: attention layers rewrote each token’s vector using its neighbors.
Common misconceptions
“Attention is a memory database.”
No — it is a computation that blends vectors in one forward pass. Long-term memory in chat apps comes from context window + RAG + fine-tuning, not attention alone.
“Higher attention weight = human importance.”
Weights are model-internal and useful for debugging, but they are not guaranteed to match human judgment.
“One attention layer understands the whole document.”
Usually many layers are needed. Early layers often track local grammar; deeper layers handle harder references.
“Attention replaces embeddings.”
Embeddings still map token IDs to vectors. Attention mixes those vectors using Q, K, V.
Mini recap
- Problem: RNNs compress the whole past into one vector — bad for long, selective dependencies.
- Idea: Let each position query all others and blend their values.
- Q, K, V: Three learned roles — question, label, content.
- Scores → softmax → weighted sum of Values.
- Result: Context-aware vectors that power transformers and modern GenAI.
Checkpoint
Test yourself before Lesson 2.
1. What problem does attention solve?
Answer sketch
Routing information between positions — especially distant ones — without forcing everything through a single sequential hidden state. Each output can focus on the inputs that matter for that word.
2. In the library analogy, what is the difference between Key and Value?
Answer sketch
Key is the label used for matching (compatibility). Value is the content you actually retrieve and blend when the match is strong.
3. Why use softmax on attention scores?
Answer sketch
To get non-negative weights that sum to 1 — a soft focus distribution over all positions, differentiable for training.
4. When translating English → French, does the French word’s Query attend to English Keys or French Keys?
Answer sketch
English Keys (and Values) from the encoder — cross-attention. The decoder queries the source sentence.
What's next
You know what attention does and how Q, K, V fit together. Next: every token in a sentence attends to every other token in the same sentence — self-attention, multiple heads, and the mask GPT uses when generating text.