Module 6 quiz and review

Before we begin

Test attention, transformers, encoders/decoders, tokenization, and vectorization — including common interview questions on KV cache, O(n²) attention, and encoder vs decoder. Aim for at least 30 out of 40.

Multiple choice quiz

Interactive quiz

Pick one answer per question. Feedback appears immediately — take your time before clicking.

0 / 40 correct·0 answered

Question 1 of 40
What problem does attention mainly solve in sequence models?
Answer options for question 1
Question 2 of 40
In attention, Query, Key, and Value vectors are used to:
Answer options for question 2
Question 3 of 40
Self-attention means:
Answer options for question 3
Question 4 of 40
Why do transformers use multi-head attention?
Answer options for question 4
Question 5 of 40
A standard transformer block usually contains:
Answer options for question 5
Question 6 of 40
Why did transformers largely replace RNNs for many NLP tasks?
Answer options for question 6
Question 7 of 40
The encoder in a transformer typically:
Answer options for question 7
Question 8 of 40
The decoder in GPT-style models uses causal (masked) self-attention so that:
Answer options for question 8
Question 9 of 40
Original translation transformers paired encoder + decoder with:
Answer options for question 9
Question 10 of 40
What is a token in an LLM pipeline?
Answer options for question 10
Question 11 of 40
Subword tokenization (BPE, WordPiece) helps because:
Answer options for question 11
Question 12 of 40
What is the context window?
Answer options for question 12
Question 13 of 40
Softmax on attention scores ensures:
Answer options for question 13
Question 14 of 40
Positional information is added in transformers because self-attention alone is:
Answer options for question 14
Question 15 of 40
BERT is mainly an ___ model; GPT is mainly a ___ model.
Answer options for question 15
Question 16 of 40
In the library analogy, Key vs Value — the difference is:
Answer options for question 16
Question 17 of 40
Cross-attention in translation: French decoder word’s Query attends to:
Answer options for question 17
Question 18 of 40
Why divide attention dot products by √dₖ (scaled dot-product)?
Answer options for question 18
Question 19 of 40
RNN hidden state bottleneck means:
Answer options for question 19
Question 20 of 40
Residual connections (skip connections) in transformer blocks help by:
Answer options for question 20
Question 21 of 40
Layer normalization in transformers is typically applied to:
Answer options for question 21
Question 22 of 40
Understanding “bank” in “river bank” vs “money bank” relies on:
Answer options for question 22
Question 23 of 40
Autoregressive generation (GPT-style) means:
Answer options for question 23
Question 24 of 40
Long documents exceeding the context window must be:
Answer options for question 24
Question 25 of 40
The feed-forward network (FFN) sublayer in each transformer block:
Answer options for question 25
Question 26 of 40
In a transformer, token embeddings map each token ID to:
Answer options for question 26
Question 27 of 40
Positional encodings are added because plain self-attention without them is:
Answer options for question 27
Question 28 of 40
Interview distinction: in-LLM token embeddings vs RAG sentence embeddings:
Answer options for question 28
Question 29 of 40
Self-attention over sequence length n has roughly O(n²) cost because:
Answer options for question 29
Question 30 of 40
For text classification (one label per sentence), teams often used encoder-only models like BERT because:
Answer options for question 30
Question 31 of 40
Interview: why not word-level tokenization for LLMs?
Answer options for question 31
Question 32 of 40
Special tokens like `[PAD]` and `[EOS]` are used to:
Answer options for question 32
Question 33 of 40
At inference, KV cache speeds autoregressive generation by:
Answer options for question 33
Question 34 of 40
T5-style models are often described as:
Answer options for question 34
Question 35 of 40
Attention weights after softmax over attended positions:
Answer options for question 35
Question 36 of 40
Interview myth: “Attention is the model’s long-term memory.” Reality:
Answer options for question 36
Question 37 of 40
The embedding table in a transformer has shape roughly:
Answer options for question 37
Question 38 of 40
A 50-page PDF exceeds the model context window. Best first step:
Answer options for question 38
Question 39 of 40
Causal masking in GPT decoders prevents the model from:
Answer options for question 39
Question 40 of 40
In multi-head attention, outputs of heads are typically:
Answer options for question 40

After the quiz

30/40 or higher? Start the mini transformer project.

Checklist:

I can explain what attention solves and why scaling uses √dₖ.
I know self-attention vs causal masking.
I can name encoder vs decoder use cases (BERT vs GPT).
I know token embeddings vs RAG retrieval embeddings.
I understand context window limits and chunking implications.

What's next

Project: mini transformer on blog text