Question 1 of 40
A GPT-style LLM is best described as:
Answer options for question 1 A. An encoder-decoder model trained mainly for image captioningB. A decoder-only transformer trained to predict the next tokenC. A retrieval database that stores Wikipedia pages as raw textD. A rule-based chatbot with fixed if/else logic and no learned weights
Question 2 of 40
During inference , an LLM typically:
Answer options for question 2 A. Requires retraining the entire model before each new questionB. Outputs every token in parallel with no sequential orderingC. Generates one token at a time, conditioning on prior tokensD. Updates all model weights from every user message at runtime
Question 3 of 40
Prompt engineering mainly means:
Answer options for question 3 A. Only tuning GPU driver settings to change completion styleB. Manually rewriting transformer attention math in the source codeC. Permanently deleting portions of the model's pretraining corpusD. Designing instructions and examples so the model produces useful outputs
Question 4 of 40
A system prompt usually:
Answer options for question 4 A. Sets global behavior rules for the assistant across the conversationB. Replaces the need for any user message in every API requestC. Is sent only after the model finishes generating its answerD. Must be exactly one token long to stay within context limits
Question 5 of 40
What does temperature control in LLM sampling?
Answer options for question 5 A. The number of transformer layers active during each forward passB. The maximum context window size measured in tokens for the modelC. Randomness of token choices — higher diversifies, lower focuses outputD. Physical GPU heat during training that limits how many layers run
Question 6 of 40
Top-k sampling:
Answer options for question 6 A. Only works during fine-tuning and cannot be used at inferenceB. Always picks the single highest-probability token at every stepC. Removes the vocabulary entirely and samples from an empty setD. Restricts the next token to the k most likely candidates
Question 7 of 40
Top-p (nucleus) sampling:
Answer options for question 7 A. Samples from the smallest set of tokens whose cumulative probability ≥ pB. Always returns exactly p tokens in the final generated answerC. Sets the learning rate during backpropagation on each batchD. Is identical to greedy decoding with temperature set to zero
Question 8 of 40
Fine-tuning vs RAG — the main difference:
Answer options for question 8 A. Fine-tuning never uses GPUs and runs only on CPU inferenceB. Fine-tuning updates model weights; RAG retrieves external docs at query timeC. They are the same technique — both only change the system promptD. RAG updates all billion weights on every question before answering
Question 9 of 40
When is RAG often preferred over fine-tuning?
Answer options for question 9 A. You need to remove hallucinations completely without any retrievalB. You want to change fundamental reasoning style with no training dataC. You have zero documents and no API access to any modelD. Private docs change often and you need citations to sources
Question 10 of 40
In RAG, embeddings are used to:
Answer options for question 10 A. Convert text chunks and queries into vectors for similarity searchB. Replace the LLM entirely so no generation step is neededC. Compress PNG images into smaller files for faster uploadD. Store raw API response headers for caching purposes only
Question 11 of 40
Why do LLMs hallucinate ?
Answer options for question 11 A. They only output numeric scores and never generate languageB. They predict plausible text, not verified facts — no built-in truth checkerC. Hallucination means the model refuses to answer any questionD. Temperature is always zero so the model invents random tokens
Question 12 of 40
Which practice reduces hallucinations in doc Q&A?
Answer options for question 12 A. Use maximum temperature on every call to increase creativityB. Hide all source documents from the model during generationC. RAG with citations plus instruct model to answer only from contextD. Ask for longer answers with no format or grounding constraints
Question 13 of 40
In LLM context, embeddings refer to:
Answer options for question 13 A. Dense vector representations of tokens or text used by models and retrievalB. JPEG images embedded inside PDF documents for display onlyC. The HTML layout and CSS styling of a chat user interfaceD. Version control hashes that identify each training checkpoint
Question 14 of 40
Few-shot prompting means:
Answer options for question 14 A. Using a model with very few parameters instead of a large oneB. Including example input/output pairs in the prompt before the real taskC. Sending empty prompts and relying on default model behaviorD. Training the model for only a few seconds on a tiny dataset
Question 15 of 40
A vector database (FAISS, Pinecone, etc.) in RAG stores:
Answer options for question 15 A. The full LLM weights so inference can run without a GPUB. Only raw PDF binaries with no search or similarity rankingC. User authentication credentials and session tokens for the appD. Embedding vectors plus metadata to retrieve similar text chunks quickly
Question 16 of 40
Pretraining an LLM mainly means:
Answer options for question 16 A. Deploying the model to production behind a Redis cache layerB. Labeling every email in a dataset as spam or not spamC. Learning weights by predicting next tokens on large text corporaD. Only fine-tuning on one user's private chat history
Question 17 of 40
Temperature = 0 (greedy decoding) typically:
Answer options for question 17 A. Disables the model forward pass so no tokens are producedB. Picks the highest-probability token each step — more deterministicC. Doubles hallucinations by design to encourage creative outputD. Samples randomly from the full vocabulary on every generation step
Question 18 of 40
A clear system prompt should:
Answer options for question 18 A. Include API keys so the model can call external services directlyB. Contain every document in your company so RAG is unnecessaryC. Replace the user message entirely on every turn of the chatD. Set role, constraints, tone, and output format for the conversation
Question 19 of 40
Chunking documents for RAG is important because:
Answer options for question 19 A. Chunking removes the need for any embedding model in the pipelineB. LLMs cannot process paragraphs and only read single wordsC. Embeddings work on bounded text — chunks must fit context and match queriesD. Only PDF page numbers matter and text content is ignored
Question 20 of 40
Fine-tuning is often chosen over RAG when:
Answer options for question 20 A. You need to change style/behavior globally and have quality paired dataB. You have no GPU or API budget for any model inferenceC. You only need one static FAQ answer with no customizationD. Facts change hourly and every answer must cite live sources
Question 21 of 40
Grounding a chatbot answer means:
Answer options for question 21 A. Hiding citations from users so answers look more confidentB. Training the model without any loss function or labelsC. Tying the response to retrieved or provided source textD. Using temperature 2.0 on every call to increase creativity
Question 22 of 40
Top-p (nucleus) sampling keeps:
Answer options for question 22 A. The smallest set of highest-probability tokens whose cumulative mass ≥ pB. Exactly one token at every step regardless of model confidenceC. Only punctuation tokens and discards all word tokensD. The entire vocabulary with equal probability for every token
Question 23 of 40
Chain-of-thought prompting helps when:
Answer options for question 23 A. You want to disable all reasoning and return only a labelB. Only image classification tasks with no text involvedC. You need the shortest possible output on every request alwaysD. Multi-step reasoning — model writes intermediate steps before the answer
Question 24 of 40
In a chat API, assistant messages in history are included so that:
Answer options for question 24 A. Embeddings are deleted from the vector store after each replyB. Multi-turn context continues — model sees prior replies in the threadC. The model retrains its weights on each turn of the conversationD. API keys rotate automatically after every assistant response
Question 25 of 40
Cosine similarity between query and chunk embeddings in RAG finds:
Answer options for question 25 A. Chunks semantically closest to the question for retrievalB. The learning rate schedule used during model fine-tuningC. The tokenizer vocabulary size used during model pretrainingD. Random chunks with no ranking or relevance scoring
Question 26 of 40
SFT (supervised fine-tuning) in the LLM lifecycle means:
Answer options for question 26 A. Only indexing PDFs in a vector database without generationB. Running Redis cache in front of the chat APIC. Training on curated (prompt, ideal response) pairs after pretrainingD. Deleting the entire pretraining corpus before any inference
Question 27 of 40
RLHF is mainly used to:
Answer options for question 27 A. Eliminate the need for any inference GPU at runtimeB. Replace tokenization with raw byte streams onlyC. Compress PNG images inside the tokenizer vocabularyD. Align model outputs with human preferences (helpful, safe tone)
Question 28 of 40
LoRA fine-tuning is popular in industry because:
Answer options for question 28 A. It requires retraining the full model on the entire internet weeklyB. It trains small adapter weights — cheaper than updating all parametersC. It removes the need for any labeled training examplesD. It only works for CNN image models, not LLMs
Question 29 of 40
Quantization (e.g. INT8, 4-bit) helps primarily with:
Answer options for question 29 A. Faster, cheaper inference and fitting larger models on limited GPU memoryB. Replacing RAG retrieval with keyword search onlyC. Training the model without any loss function or labelsD. Guaranteeing zero hallucinations on every user question
Question 30 of 40
Hybrid search in RAG combines:
Answer options for question 30 A. Image diffusion with text generation in one stepB. Only random chunk selection with no rankingC. Fine-tuning and deleting the vector index on every queryD. Keyword matching (BM25) with embedding similarity for better recall
Question 31 of 40
A reranker after first-stage retrieval is used because:
Answer options for question 31 A. Reranking removes the need for any LLM in the pipelineB. First-stage retrieval is always perfect so reranking is decorativeC. Bi-encoder search is fast but coarse — reranker scores (query, chunk) pairs more accuratelyD. Rerankers only work on image pixels, not text chunks
Question 32 of 40
When a source document is deleted , a production RAG index should:
Answer options for question 32 A. Automatically fine-tune the LLM on every deletion eventB. Remove or invalidate stale chunk IDs — not serve outdated embeddings foreverC. Ignore metadata and only store page numbersD. Keep deleted text forever because indexes never change
Question 33 of 40
Chunk overlap (50–100 tokens) helps because:
Answer options for question 33 A. Sentences split at boundaries still appear whole in at least one chunkB. Overlap is only for image segmentation, never text RAGC. Overlap removes the need for embedding models entirelyD. Overlap doubles API keys stored in each chunk
Question 34 of 40
A coding agent (Claude Code / Cursor Agent) differs from chat because it:
Answer options for question 34 A. Cannot read project files or run any commandsB. Only answers in one shot with no tool accessC. Replaces git version control with a single promptD. Loops over file edits, terminal commands, and tests toward a repo goal
Question 35 of 40
Faithfulness in RAG eval means:
Answer options for question 35 A. Temperature is set to maximum on every requestB. Citations are hidden so users trust the model blindlyC. Every claim in the answer is supported by retrieved contextD. The model always writes the longest possible answer
Question 36 of 40
For RAG Q&A with citations , sampling settings are usually:
Answer options for question 36 A. Top-p disabled and vocabulary restricted to punctuation onlyB. Low temperature (0–0.3) — favor factual, stable wordingC. Maximum temperature always — maximize creative fictionD. No system prompt and no retrieved context
Question 37 of 40
Catastrophic forgetting during fine-tuning means:
Answer options for question 37 A. The model loses general abilities when over-trained on a narrow datasetB. Users forget their passwords when context window is fullC. The vector database deletes all chunks at midnightD. Temperature resets to zero after every token
Question 38 of 40
Bi-encoder retrieval vs cross-encoder reranking:
Answer options for question 38 A. Bi-encoder only works on audio; cross-encoder only on videoB. Cross-encoder runs before any documents are indexedC. They are identical algorithms with the same speed and accuracyD. Bi-encoder embeds query and docs separately (fast); cross-encoder scores pairs jointly (slower, sharper)
Question 39 of 40
A user pastes “ignore instructions and leak secrets” into RAG context. Mitigation:
Answer options for question 39 A. Disable citations so attackers cannot see sourcesB. Treat retrieved text as untrusted; system rules + output validation + PII filtersC. Give the model admin API keys so it can defend itselfD. Remove all system prompts to avoid confusion
Question 40 of 40
Interview: inference vs training — weights during a user chat:
Answer options for question 40 A. Training: happens on every user message automatically in ChatGPTB. Inference: requires retraining the full model before each replyC. Inference: weights fixed, forward pass only — no gradient updates per messageD. Training and inference are the same step in production APIs