Metrics — confusion matrix, precision, recall & more
Before we begin
After splitting data, you need metrics that match what users care about. A single accuracy number is rarely enough — especially for spam, fraud, medical screening, and any imbalanced problem.
The tool that ties everything together is the confusion matrix: a small table of prediction outcomes. From those four cells you derive precision, recall, specificity, F1, and the thresholds you ship to production.
Confusion matrix — counts of TP, FP, FN, TN (what you got right and wrong).
Precision — of predicted positives, how many are truly positive?
Recall — of actual positives, how many did we catch?
F1 — one number balancing precision and recall.
Figure
Confusion matrix (binary)
What you will learn
- Build and read a confusion matrix from model predictions.
- Derive accuracy, precision, recall, specificity, and F1 from matrix counts.
- Tune a decision threshold and understand the precision–recall trade-off.
- Know when ROC/PR curves help (preview for imbalanced data).
- Extend ideas to multi-class problems and sklearn’s
classification_report. - Pick what to log in the spam project test evaluation.
Before this lesson
The confusion matrix — your evaluation dashboard
For binary classification, pick one class as positive (spam). The other is negative (ham).
| Predicted positive (spam) | Predicted negative (ham) | |
|---|---|---|
| Actual positive (spam) | True positive (TP) | False negative (FN) |
| Actual negative (ham) | False positive (FP) | True negative (TN) |
How to build it from predictions
- Run the model on a labeled set (usually test — only once at the end).
- Compare
y_truevsy_predrow by row. - Increment TP, FP, FN, TN.
# y_true, y_pred: lists of 0/1 where 1 = spam
tp = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
fp = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
fn = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)
tn = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0)Reading the matrix
| Cell | Meaning | User pain (spam example) |
|---|---|---|
| TP | Correctly caught spam | Inbox stays clean |
| FP | Ham called spam | Important email hidden in spam folder |
| FN | Spam called ham | Spam in inbox |
| TN | Correctly kept ham | Normal mail untouched |
Rows sum to actual class counts. Columns sum to predicted counts. If your row totals do not match your dataset label counts, something is wrong with the split or encoding.
Raw counts vs normalized matrix
| View | When to use |
|---|---|
| Counts (TP=6, FP=2, …) | Debugging, small datasets, exact error counts |
| Row-normalized (each row sums to 100%) | “Of all real spam, what % did we catch?” → recall per row |
| Column-normalized | “Of all spam predictions, what % were correct?” → precision per column |
For reports, show counts + derived metrics. Normalized heatmaps are great in dashboards.
Metrics derived from the matrix
All binary metrics below assume spam = positive class.
Accuracy
Easy to interpret when classes are balanced and all errors cost the same.
When it lies: 99% of emails are ham. A model predicting always ham gets 99% accuracy but TP = 0 — zero spam caught.
Precision (positive predictive value)
Question: Of everything we flagged as spam, how much was actually spam?
High precision → fewer false alarms → fewer good emails lost.
Product goal: “Do not hide my important mail.”
Recall (sensitivity, true positive rate)
Question: Of all real spam, how much did we catch?
High recall → cleaner inbox.
Product goal: “Keep spam out of my inbox.”
Specificity (true negative rate)
Question: Of all real ham, how much did we correctly leave alone?
Useful when protecting the majority class matters (ham must not be disturbed). Precision and specificity both punish FP, but from different denominators.
False positive rate
Common in ROC curves (plot TPR vs FPR at many thresholds).
Worked numeric example
100 emails: 10 spam, 90 ham. Model predicts 8 as spam (6 truly spam, 2 ham wrong). Misses 4 spam (predicted ham).
| Count | Value |
|---|---|
| TP | 6 |
| FP | 2 |
| FN | 4 |
| TN | 88 |
| Metric | Formula | Result |
|---|---|---|
| Accuracy | (6+88)/100 | 94% |
| Precision | 6/(6+2) | 75% — 1 in 4 spam flags wrong |
| Recall | 6/(6+4) | 60% — 40% of spam slipped through |
| Specificity | 88/(88+2) | 97.8% — ham mostly safe |
| F1 | see below | 66.7% |
Insight: 94% accuracy still allows painful spam and false flags. Always report the matrix (or TP/FP/FN/TN) plus precision and recall.
F1 score — one number for balance
When you need a single score and both false positives and false negatives matter:
For the example above: F1 = 2 × (0.75 × 0.60) / (0.75 + 0.60) ≈ 0.667.
| F1 | Interpretation |
|---|---|
| High | Both precision and recall are decent |
| Low while accuracy looks high | Class imbalance or many easy negatives masking failure on positives |
F-beta generalizes F1: F₂ weights recall higher (catch spam); F₀.₅ weights precision higher (protect ham). Mentioned for product conversations — F1 is the default balance.
Decision threshold — metrics are not fixed
Most classifiers output a probability or score, not a hard label. You choose a threshold:
# scores: P(spam) from logistic regression
y_pred = [1 if s >= 0.5 else 0 for s in scores] # 0.5 is default, not sacredFigure
Precision vs recall as threshold moves
| Threshold | Typical effect |
|---|---|
| Lower (e.g. 0.3) | Predict more spam → recall ↑, precision ↓ |
| Higher (e.g. 0.8) | Predict less spam → precision ↑, recall ↓ |
Tune on validation, not test. The spam project: try a few thresholds on val, pick one that matches product goals, then report final numbers on test once.
ROC curve & AUC (preview)
For many thresholds, plot TPR (recall) vs FPR:
- ROC curve — how well the model ranks positives above negatives across thresholds.
- AUC (area under curve) — 1.0 = perfect ranking, 0.5 = random.
When ROC helps: balanced-ish problems, comparing models independent of one threshold.
Caveat on imbalanced spam data: ROC can look optimistic when negatives dominate. Also check a precision–recall (PR) curve, which focuses on the positive class.
You do not need to implement ROC in Module 2 — but know it exists when stakeholders ask “what’s the AUC?”
Multi-class confusion matrix
With 3+ classes (e.g. cat, dog, bird), the matrix is K × K:
- Rows = actual class
- Columns = predicted class
- Diagonal = correct predictions
- Off-diagonal = confusions (cat called dog, etc.)
Per-class precision/recall/F1 are computed by treating one class as positive and all others as negative (one-vs-rest).
Macro vs weighted averages
| Average | How | When |
|---|---|---|
| Macro | Mean of per-class F1 (equal weight per class) | Rare classes matter as much as common ones |
| Weighted | Mean weighted by class support | Reflects overall performance on the full dataset |
| Micro | Pool all TP/FP/FN globally | Behaves like accuracy on multi-class counts |
sklearn classification_report (spam project)
After training, this is the fastest sanity check on test:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred)) # raw counts
print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))Example output shape:
precision recall f1-score support
ham 0.98 0.99 0.99 90
spam 0.75 0.60 0.67 10
accuracy 0.94 100
macro avg 0.87 0.80 0.83 100
weighted avg 0.96 0.94 0.95 100| Column | Meaning |
|---|---|
| precision | Per class — of predicted X, how many were X |
| recall | Per class — of actual X, how many caught |
| f1-score | Harmonic mean per class |
| support | Count of true labels — always read this |
What to report in the spam project
Minimum test-set deliverables:
- Confusion matrix (counts or labeled heatmap)
- Precision, recall, F1 for spam class
- Support (how many spam vs ham in test)
- One sentence on threshold choice from validation
Optional stretch: PR curve plot, try class_weight='balanced' in logistic regression and compare recall.
Which metric when? (cheat sheet)
| Situation | Lead metric | Also watch |
|---|---|---|
| Balanced classes, equal error cost | Accuracy | Confusion matrix |
| Imbalanced (spam, fraud) | Precision and recall (or F1) | Support, matrix |
| Missing positives is costly (fraud, disease) | Recall | FPR, precision |
| False alarms are costly (ham → spam) | Precision | Specificity |
| Compare models before picking threshold | ROC-AUC or PR-AUC | Curves at multiple thresholds |
| Multi-class | Per-class F1 + macro or weighted | Full matrix |
When is accuracy a bad sole metric?
- Imbalanced classes — majority-class classifier wins accuracy while failing the minority.
- Unequal error costs — missing fraud hurts more than a false alert.
- Safety-critical domains — false negatives can be catastrophic.
Use the confusion matrix first, then precision/recall/F1 (and curves later in your career).
Checkpoint
- A model has precision 99% and recall 10% for spam. Describe the user experience.
- FP increases. Which metrics move down — precision, recall, both, or neither?
- Why report support alongside F1?
- You tune threshold on the test set to maximize F1. What rule did you break?
Answers
- Almost every spam flag is correct (rare false positives), but most spam still arrives in the inbox — low recall.
- Precision drops (more predicted positives are wrong). Recall is unchanged if TP and FN stay the same.
- F1 on 10 spam examples vs 10,000 spam examples is not comparable — support shows class counts.
- Test set is for final reporting only — threshold tuning belongs on validation to avoid optimistic bias.
What's next
Module 2 quiz — test your understanding, then the spam classifier project where you will plot the confusion matrix for real.