Fine-tuning & quantization for work
Before we begin
Business teams need models that are fast, affordable, and on-brand. Two levers dominate: fine-tuning (adapt behavior) and quantization (shrink weights for speed).
Fine-tuning changes what the model does. Quantization changes how efficiently it runs.
What you will learn
- Choose full fine-tune vs LoRA for business requirements.
- Explain INT8 / GPTQ / AWQ quantization trade-offs.
- Know when to fine-tune vs use RAG (detailed comparison in Lesson 7).
Before this lesson
Fine-tuning for business requirements
Use cases that fit fine-tuning:
| Requirement | Example |
|---|---|
| Tone & format | Always reply in your company’s support voice |
| Structured output | Emit valid tool JSON every time |
| Domain vocabulary | Medical billing codes, legal clause templates |
| Classification head | Route tickets to departments |
LoRA (Low-Rank Adaptation): train small adapter matrices instead of all weights.
- Cheaper than full fine-tune — one GPU, hours not days.
- Swappable — base model stays frozen; swap LoRA per customer.
- Standard in Hugging Face PEFT, many cloud fine-tune APIs.
Full fine-tune: update all weights — use when LoRA plateaus or task needs deep weight change (rare for most product teams).
Data you need
Quality beats quantity:
- 500–5,000 excellent
(input, output)pairs often beat 50k noisy rows. - Include edge cases your eval set will test.
- Hold out 20% for validation — never train on your eval queries.
Quantization
Weights are usually FP16/BF16 in training. Quantization stores them in fewer bits (INT8, INT4) for faster inference and less VRAM.
| Method | Idea | Trade-off |
|---|---|---|
| INT8 | 8-bit weights | Small quality loss, big speedup on GPU |
| GPTQ / AWQ | 4-bit with calibration | Fits large models on consumer GPUs |
| Dynamic quant | Quantize at runtime | Easier, less optimal |
When it matters: self-hosting Llama on one GPU, edge deployment, high QPS API where cost dominates.
When to skip: you use OpenAI/Anthropic hosted APIs — they quantize internally.
Practical workflow
- Start with prompting + RAG (cheapest iteration).
- If format or tone still drifts → LoRA fine-tune on curated examples.
- If latency or cost hurts → quantize self-hosted model or route to smaller model.
- Eval after every change (Module 8, Lesson 7).