Fine-tuning & quantization for work

Before we begin

Business teams need models that are fast, affordable, and on-brand. Two levers dominate: fine-tuning (adapt behavior) and quantization (shrink weights for speed).

Fine-tuning changes what the model does. Quantization changes how efficiently it runs.

What you will learn

Choose full fine-tune vs LoRA for business requirements.
Explain INT8 / GPTQ / AWQ quantization trade-offs.
Know when to fine-tune vs use RAG (detailed comparison in Lesson 7).

Before this lesson

Fine-tuning for business requirements

Use cases that fit fine-tuning:

Requirement	Example
Tone & format	Always reply in your company’s support voice
Structured output	Emit valid tool JSON every time
Domain vocabulary	Medical billing codes, legal clause templates
Classification head	Route tickets to departments

LoRA (Low-Rank Adaptation): train small adapter matrices instead of all weights.

Cheaper than full fine-tune — one GPU, hours not days.
Swappable — base model stays frozen; swap LoRA per customer.
Standard in Hugging Face PEFT, many cloud fine-tune APIs.

Full fine-tune: update all weights — use when LoRA plateaus or task needs deep weight change (rare for most product teams).

Data you need

Quality beats quantity:

500–5,000 excellent (input, output) pairs often beat 50k noisy rows.
Include edge cases your eval set will test.
Hold out 20% for validation — never train on your eval queries.

Quantization

Weights are usually FP16/BF16 in training. Quantization stores them in fewer bits (INT8, INT4) for faster inference and less VRAM.

Method	Idea	Trade-off
INT8	8-bit weights	Small quality loss, big speedup on GPU
GPTQ / AWQ	4-bit with calibration	Fits large models on consumer GPUs
Dynamic quant	Quantize at runtime	Easier, less optimal

When it matters: self-hosting Llama on one GPU, edge deployment, high QPS API where cost dominates.

When to skip: you use OpenAI/Anthropic hosted APIs — they quantize internally.

Practical workflow

Start with prompting + RAG (cheapest iteration).
If format or tone still drifts → LoRA fine-tune on curated examples.
If latency or cost hurts → quantize self-hosted model or route to smaller model.
Eval after every change (Module 8, Lesson 7).

What's next

Lesson 6 — Building AI-automated workflows