Fine-tuning vs RAG
Before we begin
Your app needs company-specific knowledge. Two main paths:
Fine-tuning — change model weights on new data.
RAG (Retrieval-Augmented Generation) — fetch relevant docs at query time and paste into the prompt.
Figure
Two paths to custom knowledge
Figure
RAG pipeline
What you will learn
- Compare fine-tuning and RAG trade-offs.
- Outline chunking, embedding, and vector search.
- Know when citations require RAG.
Before this lesson
Fine-tuning
Process: continue training (full or LoRA) on curated examples.
Good for:
- Style / tone / format consistently
- Specialized vocabulary in fixed domain
- Teaching new behaviors (tool formats, JSON schemas)
Costs:
- GPU time, MLOps, retrain when data changes
- Risk of catastrophic forgetting if done poorly
- Hard to cite which document supported an answer
RAG
Process:
- Chunk documents (500–1000 tokens with overlap).
- Embed chunks → store in vector DB (FAISS local, Pinecone hosted, etc.).
- At query: embed question → retrieve top-k similar chunks.
- Prompt LLM with chunks + user question + citation rules.
Good for:
- PDFs, wikis, blogs that update often
- Citations and audit trails
- Smaller teams without fine-tune infra
Costs:
- Retrieval quality matters — bad chunks → bad answers
- Larger prompts → more tokens / latency
Difference (exam style)
| Fine-tuning | RAG | |
|---|---|---|
| Updates weights | Yes | No (uses base model) |
| Fresh docs tomorrow | Retrain | Re-index |
| Citations | Hard | Natural |
| Teaches new skill | Strong | Weaker |
Many products use both: RAG for facts + light fine-tune for tone.
Chunking tips
- Split on headings / paragraphs, not mid-sentence.
- Overlap 50–100 tokens so context isn’t lost at boundaries.
- Store metadata:
source,title,url,page.
What is embedding in LLM context?
For RAG: a sentence embedding model maps query and chunks to vectors; cosine similarity finds nearest neighbors — same intuition as Module 1 dot products, higher dimension.