LLM training lifecycle

Before we begin

A production LLM is not trained once and shipped. It moves through a lifecycle: raw data → pre-training → post-training → inference → monitoring.

Pre-training teaches language. Post-training teaches behavior. Inference is what users touch.

What you will learn

Map the end-to-end lifecycle from data to deployment.
Explain pre-training, SFT, and RLHF at a high level.
Know what happens at inference vs training time.
See where RAG and fine-tuning fit later in Module 7.

Before this lesson

Stage 1 — Data processing

Before any gradient step:

Step	Purpose
Collect	Web, books, code, licensed corpora
Filter	Remove spam, PII, toxic or low-quality text
Deduplicate	Same paragraph repeated millions of times biases the model
Normalize	Consistent encoding, strip broken HTML
Shard	Split into training chunks for distributed jobs

Bad data at this stage cannot be fixed by a bigger model.

Stage 2 — Pre-training

Goal: predict the next token on massive unlabeled text.

Train a decoder-only transformer (GPT-style) or other architecture.
Runs for weeks on thousands of GPUs.
Output: base model — strong at language, weak at following instructions.

What it learns: grammar, facts (noisy), coding patterns, reasoning traces seen in data.

What it lacks: reliable obedience to “answer in JSON” or “refuse harmful requests” — that comes next.

Stage 3 — Post-training

Turns a base model into a helpful assistant.

Supervised fine-tuning (SFT)

Curated (prompt, ideal response) pairs written by humans or teachers.
Teaches formats: chat roles, tool JSON, concise answers.

Preference tuning / RLHF

Compare two answers; train model to prefer the better one (human or AI judge).
Reduces toxic or unhelpful outputs; aligns tone with product goals.

Variants you will hear: DPO, ORPO — same family, different math; awareness is enough for engineering interviews.

Stage 4 — Inference

What ships to apps:

User sends messages via API.
Model runs forward pass only (no weight updates).
Tokens stream out until stop condition.

Not retrained on each user message unless you add fine-tuning or RAG (later lessons).

Optional inference optimizations: quantization (Lesson 5), KV-cache, speculative decoding.

Stage 5 — Operations (after launch)

Activity	Why
Monitor	Latency, errors, cost per request
Eval	Regression tests when prompts or models change
Refresh data	RAG indexes, fine-tune sets
Version models	Pin `gpt-4.1` vs `gpt-4.1-mini` per route

Module 10 (production) goes deep here; Module 8 covers evals for agents.

Lifecycle diagram (exam style)

plaintext

Raw data → clean/shard → PRE-TRAIN (next-token)
    → SFT (examples) → preference tune (RLHF/DPO)
    → deploy INFERENCE → monitor + eval + optional RAG/fine-tune

What's next

Lesson 3 — Prompt engineering