RAG engineering — chunking, indexing & reranking

Before we begin

Lesson 7 compared fine-tuning vs RAG. This lesson goes deeper on retrieval — where most production RAG quality is won or lost.

Bad retrieval cannot be prompt-engineered away. Fix the index before blaming the LLM.

Figure

Production RAG stack

Ingest → chunk → embed → index → retrieve → rerank → generate.

What you will learn

Design chunking strategies for PDFs, wikis, and code.
Build data ingestion pipelines that stay fresh.
Use vector databases and hybrid search.
Apply reranking to improve top-k quality.

Before this lesson

Chunking strategies

Strategy	When
Fixed token size (512–1024)	General docs; simple baseline
Structure-aware (headings, paragraphs)	Wikis, MDX, HTML — respect boundaries
Overlap (50–100 tokens)	Prevent answers split across chunk edges
Parent–child	Small chunks for search, large parent for LLM context

Metadata per chunk: source, title, url, page, updated_at — required for citations and debugging.

Data ingestion

Production ingestion is a pipeline, not a one-time script:

Watch sources (S3 folder, Notion export, git repo).
Parse — PDF text extraction, HTML cleanup, code AST optional.
Chunk + embed — batch for cost.
Upsert index — delete stale IDs when doc removed.
Version — tag index v2025-06-25 for rollback.

Failure modes: scanned PDFs with no OCR, tables rendered as garbage, duplicate pages bloating retrieval.

Indexing & vector databases

Option	Fit
FAISS / local	Prototypes, single-server apps
Pinecone, Weaviate, Qdrant	Managed scale, metadata filters
pgvector	Already on Postgres; good for small teams

Hybrid search: combine dense (embedding — meaning-based) + sparse (BM25 — keyword matching, like classic web search) — critical for SKU codes, legal citations, exact product names.

Reranking

First-stage retrieval returns top-20 by embedding similarity. A cross-encoder reranker scores (query, chunk) pairs more accurately → keep top-3 for the LLM.

Stage	Speed	Quality
Bi-encoder retrieve	Fast	Good recall
Cross-encoder rerank	Slower	Better precision

Many teams: top_k=20 retrieve → rerank → top_n=5 to prompt.

Evaluation hooks

Before launch, log for each query:

Retrieved chunk IDs
Rerank scores
Whether answer cites correct source

Module 8 Evals lesson formalizes this; your Module 7 RAG project should include a small held-out Q&A set.

AI safety in RAG

Ground answers — instruct model to answer only from provided chunks.
Refuse when retrieval score is below threshold.
Show citations — user verifies; reduces blind trust.
PII scan on ingest — do not index secrets.

What's next

Lesson 9 — Hallucinations, trust & AI safety