Project: RAG chatbot with citations

Before we begin

Build a career-relevant GenAI app: users ask questions; the system retrieves from your blog MDX and optional PDFs, then answers with cited sources.

Figure

What you are building

Index docs locally → Next.js chat → grounded answers with links.

How this connects to Module 7

Lesson	Where you use it
Tokenization	Chunk size in tokens/words affects retrieval quality
Embeddings	Same model at index time and query time
Vector search	FAISS finds nearest chunks by cosine similarity
Prompting	Grounding prompt forces cite-only answers
Hallucination	"If unknown, say you don't know" + eval table

Folder layout:

text

rag-chatbot/
  ingest.py              # load MDX/PDF → chunks
  build_index.py         # embed + FAISS
  query.py               # retrieve + prompt builder
  data/
    chunks.json
    index.faiss
  app/
    api/rag-chat/route.ts
    rag-lab/page.tsx     # chat UI
  eval/
    questions.json       # 10 hand-written Q&A checks

What you will build

Ingest content/blog/*.mdx (+ optional PDFs).
Chunk, embed, index with FAISS (local) or Pinecone (hosted).
Next.js chat page — message list + input.
API route — retrieve top-k chunks → call LLM with grounding prompt.
Citations — show title/URL for each excerpt used.
Mini eval — 10 hand-written questions with expected doc references.

Estimated time: 5–8 hours.

Before you start

Finish Module 7 quiz.
API key for an embedding + chat provider or Ollama locally.
pip install faiss-cpu openai pypdf tiktoken (adjust packages to your provider)

Create folder rag-chatbot/ in your workspace.

Step 1 — Ingest blog MDX

Goal: Turn each post into a document record with stable id, title, url, and full text.

python

# ingest.py
from pathlib import Path
import re
import json
 
def strip_mdx(raw: str) -> str:
    body = re.sub(r"^---.*?---\s*", "", raw, flags=re.S)
    body = re.sub(r"!\[[^\]]*\]\([^)]+\)", " ", body)
    body = re.sub(r"\[[^\]]+\]\([^)]+\)", " ", body)
    body = re.sub(r"#+\s*", "", body)
    return re.sub(r"\s+", " ", body).strip()
 
def load_mdx_docs(root="../content/blog"):
    docs = []
    for path in Path(root).glob("*.mdx"):
        raw = path.read_text(encoding="utf-8")
        body = strip_mdx(raw)
        title_match = re.search(r"^#\s+(.+)$", raw, re.M)
        title = title_match.group(1).strip() if title_match else path.stem
        docs.append({
            "id": path.stem,
            "title": title,
            "url": f"/blog/{path.stem}",
            "text": body,
        })
    return docs
 
# Optional PDF:
# from pypdf import PdfReader
# for page in PdfReader(path).pages: text += page.extract_text()

Why strip links/images? URLs and markdown syntax add noise; retrieval should match semantic content.

Step 2 — Chunk with overlap

Goal: Split long posts into ~800-word windows with overlap so sentences at boundaries aren't lost.

python

def chunk_doc(doc, size=800, overlap=120):
    words = doc["text"].split()
    chunks = []
    i = 0
    while i < len(words):
        piece = " ".join(words[i : i + size])
        chunks.append({
            "source_id": doc["id"],
            "title": doc["title"],
            "url": doc["url"],
            "chunk_index": len(chunks),
            "text": piece,
        })
        i += max(1, size - overlap)
    return chunks
 
def build_all_chunks(docs):
    out = []
    for d in docs:
        out.extend(chunk_doc(d))
    return out

Parameter	Tradeoff
`size=800`	Larger → more context per hit; smaller → more precise retrieval
`overlap=120`	Prevents cutting facts across chunk borders

Save: json.dump(chunks, open("data/chunks.json","w"), indent=2)

Step 3 — Embed and index (FAISS)

Goal: Convert each chunk to a vector; build an index for fast similarity search.

python

# build_index.py
import faiss
import numpy as np
import json
from openai import OpenAI  # or Ollama / sentence-transformers
 
client = OpenAI()
EMBED_MODEL = "text-embedding-3-small"
 
def embed_texts(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    vecs = [d.embedding for d in resp.data]
    return np.array(vecs, dtype="float32")
 
chunks = json.loads(open("data/chunks.json").read())
vectors = embed_texts([c["text"] for c in chunks])
 
# Cosine similarity = dot product after L2 normalize
faiss.normalize_L2(vectors)
index = faiss.IndexFlatIP(vectors.shape[1])
index.add(vectors)
faiss.write_index(index, "data/index.faiss")

Critical: Use the same embedding model for indexing and queries. Changing models requires rebuilding the index.

Step 4 — Retrieve at query time

python

# query.py
import faiss
import numpy as np
import json
 
index = faiss.read_index("data/index.faiss")
chunks = json.loads(open("data/chunks.json").read())
 
def retrieve(query: str, k=4):
    q = embed_texts([query])
    faiss.normalize_L2(q)
    scores, ids = index.search(q, k)
    hits = []
    for rank, idx in enumerate(ids[0]):
        if idx < 0:
            continue
        hit = {**chunks[idx], "score": float(scores[0][rank])}
        hits.append(hit)
    return hits

Log score during eval — if all scores are low (< 0.3), retrieval failed; don't let the LLM guess.

Step 5 — Grounded prompt

python

def build_prompt(question: str, hits: list[dict]) -> str:
    context = ""
    for i, h in enumerate(hits, 1):
        context += f"[{i}] Title: {h['title']}\nURL: {h['url']}\n{h['text'][:1200]}\n\n"
    return f"""You are a helpful assistant. Use ONLY the excerpts below.
Cite sources inline like [1] or [2].
If the answer is not in the excerpts, say "I don't have that in the indexed docs."
 
{context}
 
Question: {question}
Answer:"""
 
def answer(question: str):
    hits = retrieve(question, k=4)
    prompt = build_prompt(question, hits)
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
    )
    text = resp.choices[0].message.content
    citations = [{"title": h["title"], "url": h["url"], "excerpt": h["text"][:200]} for h in hits]
    return {"answer": text, "citations": citations}

Temperature 0–0.3 — lower randomness for factual Q&A.

Step 6 — Next.js API route

typescript

// app/api/rag-chat/route.ts
import { NextResponse } from "next/server";
import { spawn } from "child_process";
 
export async function POST(req: Request) {
  const { message } = await req.json();
  if (!message?.trim()) {
    return NextResponse.json({ error: "message required" }, { status: 400 });
  }
 
  // Option A: call Python script that prints JSON to stdout
  const result = await runPythonQuery(message);
  return NextResponse.json(result);
}
 
function runPythonQuery(message: string): Promise<unknown> {
  return new Promise((resolve, reject) => {
    const proc = spawn("python", ["rag-chatbot/query.py", message]);
    let out = "";
    proc.stdout.on("data", (d) => (out += d));
    proc.on("close", (code) => {
      if (code !== 0) return reject(new Error("query failed"));
      resolve(JSON.parse(out));
    });
  });
}

Option B: Port embed + FAISS to TypeScript with @xenova/transformers for small offline demos (slower, but no Python subprocess).

Step 7 — Chat UI

Client page (app/rag-lab/page.tsx):

Scrollable messages — user bubbles right, assistant left.
Assistant message renders citation chips linking to /blog/....
Loading state: "Retrieving…" then "Generating…".
Optional: show retrieved chunk titles in a sidebar for debugging.

tsx

// After fetch("/api/rag-chat")
// setMessages([...prev, { role: "assistant", text: data.answer, citations: data.citations }])

Make it meaningful: ask questions only answerable from your blog — verify citation links open the correct post.

Step 8 — Evaluation table

Create eval/questions.json:

json

[
  {
    "question": "How does on-device AI differ from cloud?",
    "expected_slug": "on-device-ai-vs-cloud-ai"
  }
]

Question	Expected source post	Correct cite?
How does on-device AI differ from cloud?	on-device-ai-vs-cloud-ai…
… add 9 more …

Run a script that calls answer(q) and checks if expected_slug appears in citation URLs. Target ≥8/10 before calling the project done.

Stretch goals

Swap FAISS → Pinecone for hosted index.
Add hybrid search (BM25 + vectors).
Stream tokens with Server-Sent Events.

Troubleshooting

Symptom	Fix
Answers ignore docs	Prompt too weak; lower temperature; show chunks in UI
Wrong post cited	Smaller chunks, higher k, or hybrid BM25
Empty index	Fix path to `content/blog`; re-run `ingest.py`
Slow first query	Batch embed at build time; cache query embeddings

Deliverables

Indexed corpus from blog (+ PDF optional)
Working chat UI + API
Answers include numbered citations
Eval table with ≥8/10 correct grounding

What's next

Module 7 complete. Continue to Module 8 — Agentic AI when ready.

Return to the AI course curriculum anytime to track progress.