← Back to curriculum

Module 6 — Transformers (core of GenAI)

Module 6 quiz & review

40 interactive questions on attention, transformers, tokenization, vectorization, and interview topics (KV cache, O(n²), BERT vs GPT).

~75 min read + exercises

Module 6 quiz and review

Before we begin

Test attention, transformers, encoders/decoders, tokenization, and vectorization — including common interview questions on KV cache, O(n²) attention, and encoder vs decoder. Aim for at least 30 out of 40.


Multiple choice quiz

Interactive quiz

Pick one answer per question. Feedback appears immediately — take your time before clicking.

0 / 40 correct·0 answered
  1. Question 1 of 40

    What problem does attention mainly solve in sequence models?

    Answer options for question 1
  2. Question 2 of 40

    In attention, Query, Key, and Value vectors are used to:

    Answer options for question 2
  3. Question 3 of 40

    Self-attention means:

    Answer options for question 3
  4. Question 4 of 40

    Why do transformers use multi-head attention?

    Answer options for question 4
  5. Question 5 of 40

    A standard transformer block usually contains:

    Answer options for question 5
  6. Question 6 of 40

    Why did transformers largely replace RNNs for many NLP tasks?

    Answer options for question 6
  7. Question 7 of 40

    The encoder in a transformer typically:

    Answer options for question 7
  8. Question 8 of 40

    The decoder in GPT-style models uses causal (masked) self-attention so that:

    Answer options for question 8
  9. Question 9 of 40

    Original translation transformers paired encoder + decoder with:

    Answer options for question 9
  10. Question 10 of 40

    What is a token in an LLM pipeline?

    Answer options for question 10
  11. Question 11 of 40

    Subword tokenization (BPE, WordPiece) helps because:

    Answer options for question 11
  12. Question 12 of 40

    What is the context window?

    Answer options for question 12
  13. Question 13 of 40

    Softmax on attention scores ensures:

    Answer options for question 13
  14. Question 14 of 40

    Positional information is added in transformers because self-attention alone is:

    Answer options for question 14
  15. Question 15 of 40

    BERT is mainly an ___ model; GPT is mainly a ___ model.

    Answer options for question 15
  16. Question 16 of 40

    In the library analogy, Key vs Value — the difference is:

    Answer options for question 16
  17. Question 17 of 40

    Cross-attention in translation: French decoder word’s Query attends to:

    Answer options for question 17
  18. Question 18 of 40

    Why divide attention dot products by √dₖ (scaled dot-product)?

    Answer options for question 18
  19. Question 19 of 40

    RNN hidden state bottleneck means:

    Answer options for question 19
  20. Question 20 of 40

    Residual connections (skip connections) in transformer blocks help by:

    Answer options for question 20
  21. Question 21 of 40

    Layer normalization in transformers is typically applied to:

    Answer options for question 21
  22. Question 22 of 40

    Understanding “bank” in “river bank” vs “money bank” relies on:

    Answer options for question 22
  23. Question 23 of 40

    Autoregressive generation (GPT-style) means:

    Answer options for question 23
  24. Question 24 of 40

    Long documents exceeding the context window must be:

    Answer options for question 24
  25. Question 25 of 40

    The feed-forward network (FFN) sublayer in each transformer block:

    Answer options for question 25
  26. Question 26 of 40

    In a transformer, token embeddings map each token ID to:

    Answer options for question 26
  27. Question 27 of 40

    Positional encodings are added because plain self-attention without them is:

    Answer options for question 27
  28. Question 28 of 40

    Interview distinction: in-LLM token embeddings vs RAG sentence embeddings:

    Answer options for question 28
  29. Question 29 of 40

    Self-attention over sequence length n has roughly O(n²) cost because:

    Answer options for question 29
  30. Question 30 of 40

    For text classification (one label per sentence), teams often used encoder-only models like BERT because:

    Answer options for question 30
  31. Question 31 of 40

    Interview: why not word-level tokenization for LLMs?

    Answer options for question 31
  32. Question 32 of 40

    Special tokens like `[PAD]` and `[EOS]` are used to:

    Answer options for question 32
  33. Question 33 of 40

    At inference, KV cache speeds autoregressive generation by:

    Answer options for question 33
  34. Question 34 of 40

    T5-style models are often described as:

    Answer options for question 34
  35. Question 35 of 40

    Attention weights after softmax over attended positions:

    Answer options for question 35
  36. Question 36 of 40

    Interview myth: “Attention is the model’s long-term memory.” Reality:

    Answer options for question 36
  37. Question 37 of 40

    The embedding table in a transformer has shape roughly:

    Answer options for question 37
  38. Question 38 of 40

    A 50-page PDF exceeds the model context window. Best first step:

    Answer options for question 38
  39. Question 39 of 40

    Causal masking in GPT decoders prevents the model from:

    Answer options for question 39
  40. Question 40 of 40

    In multi-head attention, outputs of heads are typically:

    Answer options for question 40

After the quiz

30/40 or higher? Start the mini transformer project.

Checklist:

  • I can explain what attention solves and why scaling uses √dₖ.
  • I know self-attention vs causal masking.
  • I can name encoder vs decoder use cases (BERT vs GPT).
  • I know token embeddings vs RAG retrieval embeddings.
  • I understand context window limits and chunking implications.

What's next

Project: mini transformer on blog text