← Back to curriculum

Module 2 — Core machine learning

Metrics — confusion matrix, precision, recall & F1

Build and read confusion matrices, derive precision/recall/specificity/F1, tune thresholds, sklearn classification_report, and ROC/PR preview for imbalanced spam.

~95 min read + exercises

Metrics — confusion matrix, precision, recall & more

Before we begin

After splitting data, you need metrics that match what users care about. A single accuracy number is rarely enough — especially for spam, fraud, medical screening, and any imbalanced problem.

The tool that ties everything together is the confusion matrix: a small table of prediction outcomes. From those four cells you derive precision, recall, specificity, F1, and the thresholds you ship to production.

Confusion matrix — counts of TP, FP, FN, TN (what you got right and wrong).
Precision — of predicted positives, how many are truly positive?
Recall — of actual positives, how many did we catch?
F1 — one number balancing precision and recall.

Figure

Confusion matrix (binary)

Confusion matrix (spam = positive class)PredictedActualspamhamspamhamTPspam→spamFPham→spamFNspam→hamTNham→ham
Rows = actual class, columns = predicted class. Spam is the positive class here.

What you will learn

  • Build and read a confusion matrix from model predictions.
  • Derive accuracy, precision, recall, specificity, and F1 from matrix counts.
  • Tune a decision threshold and understand the precision–recall trade-off.
  • Know when ROC/PR curves help (preview for imbalanced data).
  • Extend ideas to multi-class problems and sklearn’s classification_report.
  • Pick what to log in the spam project test evaluation.

Before this lesson


The confusion matrix — your evaluation dashboard

For binary classification, pick one class as positive (spam). The other is negative (ham).

Predicted positive (spam)Predicted negative (ham)
Actual positive (spam)True positive (TP)False negative (FN)
Actual negative (ham)False positive (FP)True negative (TN)

How to build it from predictions

  1. Run the model on a labeled set (usually test — only once at the end).
  2. Compare y_true vs y_pred row by row.
  3. Increment TP, FP, FN, TN.
python
# y_true, y_pred: lists of 0/1 where 1 = spam
tp = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
fp = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
fn = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)
tn = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0)

Reading the matrix

CellMeaningUser pain (spam example)
TPCorrectly caught spamInbox stays clean
FPHam called spamImportant email hidden in spam folder
FNSpam called hamSpam in inbox
TNCorrectly kept hamNormal mail untouched

Rows sum to actual class counts. Columns sum to predicted counts. If your row totals do not match your dataset label counts, something is wrong with the split or encoding.

Raw counts vs normalized matrix

ViewWhen to use
Counts (TP=6, FP=2, …)Debugging, small datasets, exact error counts
Row-normalized (each row sums to 100%)“Of all real spam, what % did we catch?” → recall per row
Column-normalized“Of all spam predictions, what % were correct?” → precision per column

For reports, show counts + derived metrics. Normalized heatmaps are great in dashboards.


Metrics derived from the matrix

All binary metrics below assume spam = positive class.

Accuracy

accuracy=TP+TNTP+TN+FP+FN\text{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Easy to interpret when classes are balanced and all errors cost the same.

When it lies: 99% of emails are ham. A model predicting always ham gets 99% accuracy but TP = 0 — zero spam caught.

Precision (positive predictive value)

precision=TPTP+FP\text{precision} = \frac{TP}{TP + FP}

Question: Of everything we flagged as spam, how much was actually spam?

High precision → fewer false alarms → fewer good emails lost.
Product goal: “Do not hide my important mail.”

Recall (sensitivity, true positive rate)

recall=TPTP+FN\text{recall} = \frac{TP}{TP + FN}

Question: Of all real spam, how much did we catch?

High recall → cleaner inbox.
Product goal: “Keep spam out of my inbox.”

Specificity (true negative rate)

specificity=TNTN+FP\text{specificity} = \frac{TN}{TN + FP}

Question: Of all real ham, how much did we correctly leave alone?

Useful when protecting the majority class matters (ham must not be disturbed). Precision and specificity both punish FP, but from different denominators.

False positive rate

FPR=FPFP+TN=1specificity\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{specificity}

Common in ROC curves (plot TPR vs FPR at many thresholds).


Worked numeric example

100 emails: 10 spam, 90 ham. Model predicts 8 as spam (6 truly spam, 2 ham wrong). Misses 4 spam (predicted ham).

CountValue
TP6
FP2
FN4
TN88
MetricFormulaResult
Accuracy(6+88)/10094%
Precision6/(6+2)75% — 1 in 4 spam flags wrong
Recall6/(6+4)60% — 40% of spam slipped through
Specificity88/(88+2)97.8% — ham mostly safe
F1see below66.7%

Insight: 94% accuracy still allows painful spam and false flags. Always report the matrix (or TP/FP/FN/TN) plus precision and recall.


F1 score — one number for balance

When you need a single score and both false positives and false negatives matter:

F1=2precisionrecallprecision+recall\text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

For the example above: F1 = 2 × (0.75 × 0.60) / (0.75 + 0.60) ≈ 0.667.

F1Interpretation
HighBoth precision and recall are decent
Low while accuracy looks highClass imbalance or many easy negatives masking failure on positives

F-beta generalizes F1: F₂ weights recall higher (catch spam); F₀.₅ weights precision higher (protect ham). Mentioned for product conversations — F1 is the default balance.


Decision threshold — metrics are not fixed

Most classifiers output a probability or score, not a hard label. You choose a threshold:

python
# scores: P(spam) from logistic regression
y_pred = [1 if s >= 0.5 else 0 for s in scores]  # 0.5 is default, not sacred

Figure

Precision vs recall as threshold moves

Lower threshold → flag more as spamThreshold (spam score) →Metric valueRecall ↑Precision ↑strictloose
Loosen threshold → recall rises, precision often falls (and vice versa).
ThresholdTypical effect
Lower (e.g. 0.3)Predict more spam → recall ↑, precision ↓
Higher (e.g. 0.8)Predict less spam → precision ↑, recall ↓

Tune on validation, not test. The spam project: try a few thresholds on val, pick one that matches product goals, then report final numbers on test once.


ROC curve & AUC (preview)

For many thresholds, plot TPR (recall) vs FPR:

  • ROC curve — how well the model ranks positives above negatives across thresholds.
  • AUC (area under curve) — 1.0 = perfect ranking, 0.5 = random.

When ROC helps: balanced-ish problems, comparing models independent of one threshold.

Caveat on imbalanced spam data: ROC can look optimistic when negatives dominate. Also check a precision–recall (PR) curve, which focuses on the positive class.

You do not need to implement ROC in Module 2 — but know it exists when stakeholders ask “what’s the AUC?”


Multi-class confusion matrix

With 3+ classes (e.g. cat, dog, bird), the matrix is K × K:

  • Rows = actual class
  • Columns = predicted class
  • Diagonal = correct predictions
  • Off-diagonal = confusions (cat called dog, etc.)

Per-class precision/recall/F1 are computed by treating one class as positive and all others as negative (one-vs-rest).

Macro vs weighted averages

AverageHowWhen
MacroMean of per-class F1 (equal weight per class)Rare classes matter as much as common ones
WeightedMean weighted by class supportReflects overall performance on the full dataset
MicroPool all TP/FP/FN globallyBehaves like accuracy on multi-class counts

sklearn classification_report (spam project)

After training, this is the fastest sanity check on test:

python
from sklearn.metrics import classification_report, confusion_matrix
 
print(confusion_matrix(y_test, y_pred))  # raw counts
print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))

Example output shape:

text
              precision    recall  f1-score   support
 
         ham       0.98      0.99      0.99        90
        spam       0.75      0.60      0.67        10
 
    accuracy                           0.94       100
   macro avg       0.87      0.80      0.83       100
weighted avg       0.96      0.94      0.95       100
ColumnMeaning
precisionPer class — of predicted X, how many were X
recallPer class — of actual X, how many caught
f1-scoreHarmonic mean per class
supportCount of true labels — always read this

What to report in the spam project

Minimum test-set deliverables:

  • Confusion matrix (counts or labeled heatmap)
  • Precision, recall, F1 for spam class
  • Support (how many spam vs ham in test)
  • One sentence on threshold choice from validation

Optional stretch: PR curve plot, try class_weight='balanced' in logistic regression and compare recall.


Which metric when? (cheat sheet)

SituationLead metricAlso watch
Balanced classes, equal error costAccuracyConfusion matrix
Imbalanced (spam, fraud)Precision and recall (or F1)Support, matrix
Missing positives is costly (fraud, disease)RecallFPR, precision
False alarms are costly (ham → spam)PrecisionSpecificity
Compare models before picking thresholdROC-AUC or PR-AUCCurves at multiple thresholds
Multi-classPer-class F1 + macro or weightedFull matrix

When is accuracy a bad sole metric?

  • Imbalanced classes — majority-class classifier wins accuracy while failing the minority.
  • Unequal error costs — missing fraud hurts more than a false alert.
  • Safety-critical domains — false negatives can be catastrophic.

Use the confusion matrix first, then precision/recall/F1 (and curves later in your career).


Checkpoint

  1. A model has precision 99% and recall 10% for spam. Describe the user experience.
  2. FP increases. Which metrics move down — precision, recall, both, or neither?
  3. Why report support alongside F1?
  4. You tune threshold on the test set to maximize F1. What rule did you break?
Answers
  1. Almost every spam flag is correct (rare false positives), but most spam still arrives in the inbox — low recall.
  2. Precision drops (more predicted positives are wrong). Recall is unchanged if TP and FN stay the same.
  3. F1 on 10 spam examples vs 10,000 spam examples is not comparable — support shows class counts.
  4. Test set is for final reporting only — threshold tuning belongs on validation to avoid optimistic bias.

What's next

Module 2 quiz — test your understanding, then the spam classifier project where you will plot the confusion matrix for real.