← Back to curriculum

Module 7 — GenAI & LLMs

Fine-tuning vs RAG — when to use which

Compare updating weights vs retrieving documents, hybrid patterns, and citation-friendly architectures.

~65 min read + exercises

Fine-tuning vs RAG

Before we begin

Your app needs company-specific knowledge. Two main paths:

Fine-tuning — change model weights on new data.
RAG (Retrieval-Augmented Generation) — fetch relevant docs at query time and paste into the prompt.

Figure

Two paths to custom knowledge

Fine-tuningupdate model weightsRAGretrieve docs → prompt
Fine-tuning bakes data in; RAG retrieves fresh context per question.

Figure

RAG pipeline

QueryuserEmbedvectorRetrieveFAISSLLMgroundedCitesources
Embed query → search vector index → LLM answers with excerpts.

What you will learn

  • Compare fine-tuning and RAG trade-offs.
  • Outline chunking, embedding, and vector search.
  • Know when citations require RAG.

Before this lesson


Fine-tuning

Process: continue training (full or LoRA) on curated examples.

Good for:

  • Style / tone / format consistently
  • Specialized vocabulary in fixed domain
  • Teaching new behaviors (tool formats, JSON schemas)

Costs:

  • GPU time, MLOps, retrain when data changes
  • Risk of catastrophic forgetting if done poorly
  • Hard to cite which document supported an answer

RAG

Process:

  1. Chunk documents (500–1000 tokens with overlap).
  2. Embed chunks → store in vector DB (FAISS local, Pinecone hosted, etc.).
  3. At query: embed question → retrieve top-k similar chunks.
  4. Prompt LLM with chunks + user question + citation rules.

Good for:

  • PDFs, wikis, blogs that update often
  • Citations and audit trails
  • Smaller teams without fine-tune infra

Costs:

  • Retrieval quality matters — bad chunks → bad answers
  • Larger prompts → more tokens / latency

Difference (exam style)

Fine-tuningRAG
Updates weightsYesNo (uses base model)
Fresh docs tomorrowRetrainRe-index
CitationsHardNatural
Teaches new skillStrongWeaker

Many products use both: RAG for facts + light fine-tune for tone.


Chunking tips

  • Split on headings / paragraphs, not mid-sentence.
  • Overlap 50–100 tokens so context isn’t lost at boundaries.
  • Store metadata: source, title, url, page.

What is embedding in LLM context?

For RAG: a sentence embedding model maps query and chunks to vectors; cosine similarity finds nearest neighbors — same intuition as Module 1 dot products, higher dimension.


What's next

Lesson 8 — RAG engineering