LLM training lifecycle
Before we begin
A production LLM is not trained once and shipped. It moves through a lifecycle: raw data → pre-training → post-training → inference → monitoring.
Pre-training teaches language. Post-training teaches behavior. Inference is what users touch.
What you will learn
- Map the end-to-end lifecycle from data to deployment.
- Explain pre-training, SFT, and RLHF at a high level.
- Know what happens at inference vs training time.
- See where RAG and fine-tuning fit later in Module 7.
Before this lesson
Stage 1 — Data processing
Before any gradient step:
| Step | Purpose |
|---|---|
| Collect | Web, books, code, licensed corpora |
| Filter | Remove spam, PII, toxic or low-quality text |
| Deduplicate | Same paragraph repeated millions of times biases the model |
| Normalize | Consistent encoding, strip broken HTML |
| Shard | Split into training chunks for distributed jobs |
Bad data at this stage cannot be fixed by a bigger model.
Stage 2 — Pre-training
Goal: predict the next token on massive unlabeled text.
- Train a decoder-only transformer (GPT-style) or other architecture.
- Runs for weeks on thousands of GPUs.
- Output: base model — strong at language, weak at following instructions.
What it learns: grammar, facts (noisy), coding patterns, reasoning traces seen in data.
What it lacks: reliable obedience to “answer in JSON” or “refuse harmful requests” — that comes next.
Stage 3 — Post-training
Turns a base model into a helpful assistant.
Supervised fine-tuning (SFT)
- Curated (prompt, ideal response) pairs written by humans or teachers.
- Teaches formats: chat roles, tool JSON, concise answers.
Preference tuning / RLHF
- Compare two answers; train model to prefer the better one (human or AI judge).
- Reduces toxic or unhelpful outputs; aligns tone with product goals.
Variants you will hear: DPO, ORPO — same family, different math; awareness is enough for engineering interviews.
Stage 4 — Inference
What ships to apps:
- User sends messages via API.
- Model runs forward pass only (no weight updates).
- Tokens stream out until stop condition.
Not retrained on each user message unless you add fine-tuning or RAG (later lessons).
Optional inference optimizations: quantization (Lesson 5), KV-cache, speculative decoding.
Stage 5 — Operations (after launch)
| Activity | Why |
|---|---|
| Monitor | Latency, errors, cost per request |
| Eval | Regression tests when prompts or models change |
| Refresh data | RAG indexes, fine-tune sets |
| Version models | Pin gpt-4.1 vs gpt-4.1-mini per route |
Module 10 (production) goes deep here; Module 8 covers evals for agents.
Lifecycle diagram (exam style)
Raw data → clean/shard → PRE-TRAIN (next-token)
→ SFT (examples) → preference tune (RLHF/DPO)
→ deploy INFERENCE → monitor + eval + optional RAG/fine-tune