Question 1 of 40
What problem does attention mainly solve in sequence models?
Answer options for question 1 A. Removing the need for any labeled training data in supervised fine-tuningB. Letting each position directly look at and weight all other relevant positionsC. Converting raw image pixels into natural-language captions automaticallyD. Replacing loss functions with raw classification accuracy as the only metric
Question 2 of 40
In attention, Query , Key , and Value vectors are used to:
Answer options for question 2 A. Shuffle the training dataset randomly at the start of every epochB. Compress image files into a smaller storage format before tokenizationC. Score compatibility with Q·K, then blend Value vectors weighted by those scoresD. Store the optimizer learning rate schedule across training epochs
Question 3 of 40
Self-attention means:
Answer options for question 3 A. Attention weights are computed using only the final token in the sequenceB. The model inspects only its own weight matrices, ignoring the input tokensC. Training proceeds without any labels by using only unlabeled raw textD. Positions within the same input sequence attend to one another
Question 4 of 40
Why do transformers use multi-head attention?
Answer options for question 4 A. Several parallel attention patterns — syntax, coreference, etc. — in one layerB. To require exactly one GPU per attention head during distributed trainingC. To eliminate token embeddings by operating directly on raw character bytesD. Each head must tokenize the input with a completely separate vocabulary
Question 5 of 40
A standard transformer block usually contains:
Answer options for question 5 A. Convolution layers only — self-attention is not part of the standard blockB. Pooling and flatten operations that output one vector per input imageC. Multi-head self-attention, feed-forward network, residual links, and layer normD. A single unidirectional LSTM cell with no attention or normalization layers
Question 6 of 40
Why did transformers largely replace RNNs for many NLP tasks?
Answer options for question 6 A. RNNs always achieved higher accuracy on every NLP benchmark without exceptionB. They never require GPUs because all attention runs efficiently on CPU onlyC. They cannot perform machine translation between any pair of languagesD. Parallel training over sequence length and stronger long-range token connections
Question 7 of 40
The encoder in a transformer typically:
Answer options for question 7 A. Reads the full input sequence and builds contextual token representationsB. Generates output tokens one at a time with no access to any source contextC. Tokenizes raw bytes using fixed rules without any learned representationsD. Runs only during inference — it is skipped entirely during model training
Question 8 of 40
The decoder in GPT-style models uses causal (masked) self-attention so that:
Answer options for question 8 A. Training labels are removed so the model learns without supervision signalsB. Each token can attend only to previous tokens, not to any future positionsC. The model outputs image tensors instead of discrete token probability vectorsD. Every token can see the entire future paragraph during next-token training
Question 9 of 40
Original translation transformers paired encoder + decoder with:
Answer options for question 9 A. No attention mechanism linking source tokens to generated target tokensB. Two decoder stacks only — no encoder processes the source sentence at allC. A CNN image encoder paired with an RNN decoder — no transformer attentionD. Encoder reads source; decoder generates target with cross-attention to encoder
Question 10 of 40
What is a token in an LLM pipeline?
Answer options for question 10 A. A text piece — word, subword, or byte — mapped to an integer ID in the vocabB. A physical memory address on the GPU where one embedding vector is storedC. Always exactly one full English word — subwords are never used in practiceD. The scalar learning rate value applied by the optimizer at training step t
Question 11 of 40
Subword tokenization (BPE, WordPiece) helps because:
Answer options for question 11 A. It always makes every input sequence shorter regardless of language or textB. Rare words split into known pieces — smaller vocab, fewer unknown tokensC. It eliminates the need for learned embeddings by using ASCII codes directlyD. It permanently removes all punctuation so models never see comma or period
Question 12 of 40
What is the context window ?
Answer options for question 12 A. Fraction of the dataset reserved for validation during train/val splittingB. The pixel width of the browser tab displaying the chat interfaceC. Maximum number of tokens the model can process in one forward passD. Total number of training epochs before the learning rate reaches zero
Question 13 of 40
Softmax on attention scores ensures:
Answer options for question 13 A. Attention weights are positive and sum to 1 over the attended positionsB. All attention weights are forced to exactly zero before blending valuesC. Gradients cannot flow backward through the attention computation at allD. Only the first token in the sequence receives any non-zero weight
Question 14 of 40
Positional information is added in transformers because self-attention alone is:
Answer options for question 14 A. Designed exclusively for image pixels — text order is handled by CNNsB. Permutation-invariant — token order would be lost without position signalsC. Replaced entirely by batch normalization across the sequence dimensionD. Already fully aware of word order without any positional encoding added
Question 15 of 40
BERT is mainly an ___ model; GPT is mainly a ___ model.
Answer options for question 15 A. Unsupervised-only architecture; supervised-only architecture with no pretrainB. Decoder (left-to-right only); encoder (full-sequence bidirectional reading)C. Convolutional image network; recurrent sequence model without attentionD. Encoder (bidirectional context); decoder (causal / autoregressive generation)
Question 16 of 40
In the library analogy, Key vs Value — the difference is:
Answer options for question 16 A. Key is the final layer output; Value is the cross-entropy loss scalarB. They are always identical vectors — attention never separates their rolesC. Key is for matching compatibility; Value is the content blended when matchedD. Value vectors are used only inside CNNs, never in transformer attention
Question 17 of 40
Cross-attention in translation: French decoder word’s Query attends to:
Answer options for question 17 A. Raw image pixel intensities from a parallel vision encoder branchB. English encoder Keys and Values from the full source sentence representationC. Random noise vectors injected for regularization during training onlyD. Only future French target tokens that have not yet been generated
Question 18 of 40
Why divide attention dot products by √dₖ (scaled dot-product)?
Answer options for question 18 A. Scaling is required only when training CNNs, not transformer attentionB. To convert token embeddings into RGB image tensors for a vision headC. To remove the need for softmax entirely in the attention weighting stepD. Keeps score magnitudes stable so softmax does not become excessively sharp
Question 19 of 40
RNN hidden state bottleneck means:
Answer options for question 19 A. RNNs cannot run on GPUs because recurrence is inherently sequential onlyB. Attention layers cannot be trained with backpropagation on long sequencesC. All past tokens must compress into one fixed-size vector carried forwardD. Each token automatically gets unlimited separate memory with no compression
Question 20 of 40
Residual connections (skip connections) in transformer blocks help by:
Answer options for question 20 A. Letting gradients flow and stabilizing training in deep multi-layer stacksB. Doubling the vocabulary size by concatenating two tokenizers togetherC. Making inference impossible because outputs must always bypass all layersD. Removing the need for multi-head self-attention in every transformer block
Question 21 of 40
Layer normalization in transformers is typically applied to:
Answer options for question 21 A. Delete positional embeddings so the model becomes order-invariantB. Replace the tokenization step entirely before embeddings are looked upC. Stabilize activations within each layer — often applied after sublayersD. Only the final cross-entropy loss scalar at the end of training
Question 22 of 40
Understanding “bank” in “river bank” vs “money bank” relies on:
Answer options for question 22 A. Context from neighboring words blended via self-attention at each positionB. Ignoring all other tokens so each word keeps a single fixed meaningC. Removing embeddings and using only the first letter of each tokenD. Only the first letter of the ambiguous word — context is not used
Question 23 of 40
Autoregressive generation (GPT-style) means:
Answer options for question 23 A. The model never uses causal masking during next-token prediction trainingB. Only BERT-style encoder models can generate text autoregressivelyC. All output tokens are generated in parallel with no left-to-right orderingD. Each new token is predicted from all previously generated tokens in the prefix
Question 24 of 40
Long documents exceeding the context window must be:
Answer options for question 24 A. Ignored automatically by every LLM without any preprocessing strategyB. Truncated, chunked, or summarized — the model cannot attend beyond the windowC. Fed in one shot with no length limit because windows grow at inference timeD. Converted to grayscale images before the tokenizer can process them
Question 25 of 40
The feed-forward network (FFN) sublayer in each transformer block:
Answer options for question 25 A. Processes each token position independently after attention has mixed contextB. Replaces self-attention entirely so no token ever sees other positionsC. Tokenizes raw byte streams before integer IDs are passed to embeddingsD. Runs only on the held-out validation split, never on training batches
Question 26 of 40
In a transformer, token embeddings map each token ID to:
Answer options for question 26 A. The final softmax probability over the entire vocabularyB. A PNG image thumbnail stored beside each word in the vocabC. A dense vector looked up from an embedding table before attention runsD. A raw ASCII character code with no learned representation
Question 27 of 40
Positional encodings are added because plain self-attention without them is:
Answer options for question 27 A. Only used in CNNs, never in language transformersB. Unable to run on GPUs during training or inferenceC. Already aware of grammar without any position signalD. Order-blind — “cat sat” and “sat cat” would look the same
Question 28 of 40
Interview distinction: in-LLM token embeddings vs RAG sentence embeddings :
Answer options for question 28 A. They are always the same weights stored in one shared tableB. Token embeddings feed generation inside the model; RAG embeddings search documentsC. RAG embeddings replace tokenization entirely in GPT modelsD. Token embeddings are only for images; RAG only for audio
Question 29 of 40
Self-attention over sequence length n has roughly O(n²) cost because:
Answer options for question 29 A. Each token pair can be scored — n positions attend to n positionsB. FFN layers always multiply n by the number of training epochsC. Tokenization splits every word into exactly two subwordsD. The vocabulary size squares on every forward pass
Question 30 of 40
For text classification (one label per sentence), teams often used encoder-only models like BERT because:
Answer options for question 30 A. Classification requires autoregressive next-token sampling onlyB. Decoder-only models cannot read more than one token totalC. Encoders generate long stories left-to-right with causal maskingD. Bidirectional context helps understand the full sentence before classification
Question 31 of 40
Interview: why not word-level tokenization for LLMs?
Answer options for question 31 A. Words are always exactly 4 characters in every languageB. Subword methods are banned in modern transformer trainingC. Huge vocabularies and many unknown words for rare or new termsD. Word-level always produces shorter sequences than byte-level
Question 32 of 40
Special tokens like `[PAD]` and `[EOS]` are used to:
Answer options for question 32 A. Store user passwords inside the tokenizer vocabulary fileB. Pad batches to equal length and mark sequence start/end for the modelC. Double the learning rate on the final transformer layer onlyD. Replace the optimizer during fine-tuning on labeled pairs
Question 33 of 40
At inference , KV cache speeds autoregressive generation by:
Answer options for question 33 A. Reusing stored key/value vectors from prior tokens instead of recomputing themB. Replacing attention with a fixed CNN on every stepC. Deleting the vocabulary so only one token can ever be outputD. Training new weights after every generated token
Question 34 of 40
T5 -style models are often described as:
Answer options for question 34 A. Pure CNN image classifiers with no language componentB. RNN-only models that cannot use attention layersC. Diffusion models for generating images from noiseD. Encoder-decoder transformers framing many tasks as text-to-text
Question 35 of 40
Attention weights after softmax over attended positions:
Answer options for question 35 A. Are stored permanently as long-term user memory in the cloudB. Replace backpropagation so gradients never flow backwardC. Are non-negative and sum to 1 — a weighted blend of Value vectorsD. Must always be exactly zero for every position
Question 36 of 40
Interview myth: “Attention is the model’s long-term memory.” Reality:
Answer options for question 36 A. Attention removes the need for any external database in all appsB. Attention mixes tokens in one forward pass — persistent memory needs context, RAG, or storageC. Attention permanently writes every user message into model weights at inferenceD. Attention only works on the validation split, never on training data
Question 37 of 40
The embedding table in a transformer has shape roughly:
Answer options for question 37 A. [vocabulary_size × embedding_dim] — one row per token IDB. [batch_size × image_height × image_width × 3]C. [number_of_layers × learning_rate × epoch_count]D. [context_window × temperature × top_p]
Question 38 of 40
A 50-page PDF exceeds the model context window . Best first step:
Answer options for question 38 A. Convert the PDF to audio so tokenization is skippedB. Retrain the base model from scratch on that one PDF onlyC. Paste all pages anyway — windows grow automatically at runtimeD. Chunk or summarize — the model cannot attend beyond the window in one pass
Question 39 of 40
Causal masking in GPT decoders prevents the model from:
Answer options for question 39 A. Generating more than one token per user sessionB. Looking at future tokens when predicting the next token at position tC. Using GPU acceleration during batched inferenceD. Attending to any previous tokens in the prefix
Question 40 of 40
In multi-head attention , outputs of heads are typically:
Answer options for question 40 A. Discarded so only the first head affects the next layerB. Averaged into a single scalar loss with no projectionC. Concatenated and linearly projected back to model dimensionD. Written to disk as JPEG files for visualization only