← Back to curriculum

Module 7 — GenAI & LLMs

Fine-tuning & quantization for work

LoRA and full fine-tuning for business requirements, INT8/GPTQ quantization for fast inference, and when each approach fits.

~80 min read + exercises

Fine-tuning & quantization for work

Before we begin

Business teams need models that are fast, affordable, and on-brand. Two levers dominate: fine-tuning (adapt behavior) and quantization (shrink weights for speed).

Fine-tuning changes what the model does. Quantization changes how efficiently it runs.


What you will learn

  • Choose full fine-tune vs LoRA for business requirements.
  • Explain INT8 / GPTQ / AWQ quantization trade-offs.
  • Know when to fine-tune vs use RAG (detailed comparison in Lesson 7).

Before this lesson


Fine-tuning for business requirements

Use cases that fit fine-tuning:

RequirementExample
Tone & formatAlways reply in your company’s support voice
Structured outputEmit valid tool JSON every time
Domain vocabularyMedical billing codes, legal clause templates
Classification headRoute tickets to departments

LoRA (Low-Rank Adaptation): train small adapter matrices instead of all weights.

  • Cheaper than full fine-tune — one GPU, hours not days.
  • Swappable — base model stays frozen; swap LoRA per customer.
  • Standard in Hugging Face PEFT, many cloud fine-tune APIs.

Full fine-tune: update all weights — use when LoRA plateaus or task needs deep weight change (rare for most product teams).


Data you need

Quality beats quantity:

  • 500–5,000 excellent (input, output) pairs often beat 50k noisy rows.
  • Include edge cases your eval set will test.
  • Hold out 20% for validation — never train on your eval queries.

Quantization

Weights are usually FP16/BF16 in training. Quantization stores them in fewer bits (INT8, INT4) for faster inference and less VRAM.

MethodIdeaTrade-off
INT88-bit weightsSmall quality loss, big speedup on GPU
GPTQ / AWQ4-bit with calibrationFits large models on consumer GPUs
Dynamic quantQuantize at runtimeEasier, less optimal

When it matters: self-hosting Llama on one GPU, edge deployment, high QPS API where cost dominates.

When to skip: you use OpenAI/Anthropic hosted APIs — they quantize internally.


Practical workflow

  1. Start with prompting + RAG (cheapest iteration).
  2. If format or tone still drifts → LoRA fine-tune on curated examples.
  3. If latency or cost hurts → quantize self-hosted model or route to smaller model.
  4. Eval after every change (Module 8, Lesson 7).

What's next

Lesson 6 — Building AI-automated workflows