Attention mechanism — Query, Key, Value

Before we begin

In Module 4 you saw RNNs and LSTMs read text one token at a time, carrying a single hidden state forward like a backpack. That works for short sentences — but real language needs selective focus.

When you read “The cat that chased the mouse sat on the mat”, understanding sat requires linking back to cat, not mat or mouse. The link is long-range and specific. RNNs must squeeze all of that through one vector updated at every step. Important details get diluted or forgotten.

Attention fixes this. Instead of only remembering the last hidden state, each position can look directly at every other position and ask: “Which words are relevant to me right now?”

That idea — introduced for machine translation in the 2010s and scaled up in the 2017 Transformer paper — is the core of GPT, BERT, Claude, and nearly every modern GenAI model.

What problem does attention solve? Long-range dependencies and selective routing — connecting distant words without forcing everything through a narrow sequential bottleneck.

Figure

Query, Key, Value

Score how well a Query matches each Key, then blend the corresponding Value vectors — a differentiable soft lookup.

What you will learn

State clearly what problem attention solves (and what it does not solve).
Explain Query, Key, and Value with a concrete sentence example.
Walk through the four steps of attention: scores → weights → blend.
Understand why softmax turns scores into a probability-like focus pattern.
Contrast attention with the RNN hidden-state bottleneck.
Preview self-attention vs cross-attention (Lesson 2 goes deeper).

Before this lesson

Module 6 welcome
Module 4 — RNNs — hidden state and long-sequence limits
Module 4 — Word embeddings — each word is already a vector

The bottleneck RNNs create

Picture an RNN translating English → French. At step 5 it must produce a French verb. To pick the right gender and number, it may need a noun from step 1 — ten steps ago.

The RNN only has h₄ (one fixed-size vector) to carry that information. Early words compete for space in the same backpack:

Problem	What happens
Compression	Many tokens must share one small vector
Distance	Information from step 1 passes through many multiplications to reach step 10
No explicit links	The model never directly says “word 5 attends to word 1” — it must hope h stores that

LSTMs and GRUs help with vanishing gradients, but they do not remove the bottleneck: still one hidden state per step, still sequential.

Attention adds an explicit routing mechanism: at each output step, the model can pull information from any input position it needs.

What is attention? (plain language)

Attention is a way to compute a new vector for one position by taking a weighted average of vectors from other positions — where the weights come from how relevant each source is.

In one sentence:

Attention = “How much should I listen to each other word?” → blend their information accordingly.

It is soft (weights are continuous, not hard 0/1) and differentiable (gradients flow through weights, so training learns what to attend to).

The library analogy (Query, Key, Value)

Think of a library search:

Role	Symbol	Library analogy	In a sentence
Query	Q	Your question: “I need info about the subject of this verb”	What the current word is looking for
Key	K	Labels on shelves: “grammar”, “subject noun”, “location”	How each other word advertises what it offers
Value	V	The actual books on the shelf	The content each word contributes if selected

Process:

Compare your Query to every shelf Key → compatibility scores.
Turn scores into weights (high match = high weight).
Read mostly the Value books from shelves with high weights.

Neural attention does the same with vectors and learned linear layers instead of humans and paper.

Important: Q, K, and V are not three copies of the word embedding. They are three different projections of the input — three “views” of the same token, learned during training so the model discovers useful roles.

Worked example: “The cat sat”

Take a tiny sentence (3 tokens). Each token already has an embedding vector from Module 4.

When we build a richer representation for sat, attention might learn patterns like:

Token	Role for understanding sat	Intuition
The	Low weight	Article — little semantic content for the verb
cat	High weight	Subject — who sat?
sat	Medium weight	Often attends to itself too

So the new vector for sat ≈ mostly cat’s information + a bit of sat + almost none of The.

Different words ask different questions:

cat might attend strongly to sat (predicate link).
The might spread weight evenly (weak, structural).

The model learns these patterns from data — we do not hand-code “verbs attend to subjects.”

The four steps (conceptual → formula)

For one Query position (e.g. the word sat):

Step 1 — Build Q, K, V

From input embeddings, multiply by learned matrices:

Q = “what am I looking for?”
K = “what do I offer?”
V = “what content do I pass if chosen?”

Every token gets its own Q, K, and V.

Step 2 — Compatibility scores

Compare the Query at position i to every Key (all positions):

$\text{score}_{i,j} = Q_i \cdot K_j$

High dot product → vectors point in similar directions in learned space → compatible.

In translation (cross-attention), Q often comes from the target language word being generated; K and V come from the source sentence. The French decoder queries the English encoder: “Which English words matter for this French word?”

Step 3 — Softmax → attention weights

Raw scores can be any real numbers. Softmax converts them to weights that are:

Positive
Sum to 1 (like a probability distribution over positions)

$\text{weight}_{i,j} = \text{softmax}(\text{score}_{i,j})$

So position i “spends” 100% of its focus budget across all keys. If cat gets 0.7 and The gets 0.05, the model is confident about what matters.

Why softmax? It gives a smooth, trainable way to pick many sources with different strengths — not winner-take-all, unless the model learns very peaked scores.

Scaled dot-product: In practice, scores are divided by $\sqrt{d_k}$ (dimension of keys) so dot products do not grow huge and make softmax too sharp. You will see this written as “scaled dot-product attention” — same idea, stabler training.

Step 4 — Weighted sum of Values

$\text{output}_i = \sum_j \text{weight}_{i,j} \cdot V_j$

The new vector for sat is a blend of all Value vectors, emphasizing cat’s V.

That output replaces (or is added to) the old representation — richer context-aware meaning.

Figure

End-to-end flow

Same diagram as above — now you know what each box means.

Attention vs one big hidden state

	RNN hidden state	Attention
Information path	Sequential — must pass through every step	Direct — any position can link to any other
Capacity	One vector for everything so far	Each position gets a custom mix
Interpretability	Opaque compressed memory	Weights can be visualized (which words linked)
Parallelism	Step t waits for t−1	All positions can be processed together (Lesson 2)

Attention does not delete the need for depth — transformers stack many attention layers so meaning builds gradually (syntax → semantics → discourse). One layer might link subject–verb; a later layer might link “it” → “cat” across a paragraph.

Self-attention vs cross-attention (preview)

Both use Q, K, V. The difference is where Q and K/V come from:

Type	Q from	K, V from	Typical use
Cross-attention	Decoder (target)	Encoder (source)	Translation: French word looks at English sentence
Self-attention	Same sequence	Same sequence	GPT/BERT: each word looks at other words in the same sentence

Lesson 2 focuses on self-attention, multi-head attention, and causal masking — the heart of modern LLMs.

Where you meet attention in GenAI

System	How attention shows up
GPT-style chat	Stacked causal self-attention — each token sees only past tokens when generating
BERT / embeddings	Bidirectional self-attention — each token sees left and right (reading, not generating)
Vision models	Patch attention — image patches attend to other patches (same QKV idea)
RAG + LLM	Retrieval is separate; inside the LLM, attention still mixes tokens in your prompt

When people say transformers “understand context,” they largely mean: attention layers rewrote each token’s vector using its neighbors.

Common misconceptions

“Attention is a memory database.”
No — it is a computation that blends vectors in one forward pass. Long-term memory in chat apps comes from context window + RAG + fine-tuning, not attention alone.

“Higher attention weight = human importance.”
Weights are model-internal and useful for debugging, but they are not guaranteed to match human judgment.

“One attention layer understands the whole document.”
Usually many layers are needed. Early layers often track local grammar; deeper layers handle harder references.

“Attention replaces embeddings.”
Embeddings still map token IDs to vectors. Attention mixes those vectors using Q, K, V.

Mini recap

Problem: RNNs compress the whole past into one vector — bad for long, selective dependencies.
Idea: Let each position query all others and blend their values.
Q, K, V: Three learned roles — question, label, content.
Scores → softmax → weighted sum of Values.
Result: Context-aware vectors that power transformers and modern GenAI.

Checkpoint

Test yourself before Lesson 2.

1. What problem does attention solve?

Answer sketch

Routing information between positions — especially distant ones — without forcing everything through a single sequential hidden state. Each output can focus on the inputs that matter for that word.

2. In the library analogy, what is the difference between Key and Value?

Answer sketch

Key is the label used for matching (compatibility). Value is the content you actually retrieve and blend when the match is strong.

3. Why use softmax on attention scores?

Answer sketch

To get non-negative weights that sum to 1 — a soft focus distribution over all positions, differentiable for training.

4. When translating English → French, does the French word’s Query attend to English Keys or French Keys?

Answer sketch

English Keys (and Values) from the encoder — cross-attention. The decoder queries the source sentence.

What's next

You know what attention does and how Q, K, V fit together. Next: every token in a sentence attends to every other token in the same sentence — self-attention, multiple heads, and the mask GPT uses when generating text.

Lesson 2 — Self-attention & multi-head attention