Metrics — confusion matrix, precision, recall & more

Before we begin

After splitting data, you need metrics that match what users care about. A single accuracy number is rarely enough — especially for spam, fraud, medical screening, and any imbalanced problem.

The tool that ties everything together is the confusion matrix: a small table of prediction outcomes. From those four cells you derive precision, recall, specificity, F1, and the thresholds you ship to production.

Confusion matrix — counts of TP, FP, FN, TN (what you got right and wrong).
Precision — of predicted positives, how many are truly positive?
Recall — of actual positives, how many did we catch?
F1 — one number balancing precision and recall.

Figure

Confusion matrix (binary)

Rows = actual class, columns = predicted class. Spam is the positive class here.

What you will learn

Build and read a confusion matrix from model predictions.
Derive accuracy, precision, recall, specificity, and F1 from matrix counts.
Tune a decision threshold and understand the precision–recall trade-off.
Know when ROC/PR curves help (preview for imbalanced data).
Extend ideas to multi-class problems and sklearn’s classification_report.
Pick what to log in the spam project test evaluation.

Before this lesson

Lesson 4 — Train / validation / test

The confusion matrix — your evaluation dashboard

For binary classification, pick one class as positive (spam). The other is negative (ham).

	Predicted positive (spam)	Predicted negative (ham)
Actual positive (spam)	True positive (TP)	False negative (FN)
Actual negative (ham)	False positive (FP)	True negative (TN)

How to build it from predictions

Run the model on a labeled set (usually test — only once at the end).
Compare y_true vs y_pred row by row.
Increment TP, FP, FN, TN.

python

# y_true, y_pred: lists of 0/1 where 1 = spam
tp = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
fp = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
fn = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)
tn = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0)

Reading the matrix

Cell	Meaning	User pain (spam example)
TP	Correctly caught spam	Inbox stays clean
FP	Ham called spam	Important email hidden in spam folder
FN	Spam called ham	Spam in inbox
TN	Correctly kept ham	Normal mail untouched

Rows sum to actual class counts. Columns sum to predicted counts. If your row totals do not match your dataset label counts, something is wrong with the split or encoding.

Raw counts vs normalized matrix

View	When to use
Counts (TP=6, FP=2, …)	Debugging, small datasets, exact error counts
Row-normalized (each row sums to 100%)	“Of all real spam, what % did we catch?” → recall per row
Column-normalized	“Of all spam predictions, what % were correct?” → precision per column

For reports, show counts + derived metrics. Normalized heatmaps are great in dashboards.

Metrics derived from the matrix

All binary metrics below assume spam = positive class.

Accuracy

$\text{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

Easy to interpret when classes are balanced and all errors cost the same.

When it lies: 99% of emails are ham. A model predicting always ham gets 99% accuracy but TP = 0 — zero spam caught.

Precision (positive predictive value)

$\text{precision} = \frac{TP}{TP + FP}$

Question: Of everything we flagged as spam, how much was actually spam?

High precision → fewer false alarms → fewer good emails lost.
Product goal: “Do not hide my important mail.”

Recall (sensitivity, true positive rate)

$\text{recall} = \frac{TP}{TP + FN}$

Question: Of all real spam, how much did we catch?

High recall → cleaner inbox.
Product goal: “Keep spam out of my inbox.”

Specificity (true negative rate)

$\text{specificity} = \frac{TN}{TN + FP}$

Question: Of all real ham, how much did we correctly leave alone?

Useful when protecting the majority class matters (ham must not be disturbed). Precision and specificity both punish FP, but from different denominators.

False positive rate

$\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{specificity}$

Common in ROC curves (plot TPR vs FPR at many thresholds).

Worked numeric example

100 emails: 10 spam, 90 ham. Model predicts 8 as spam (6 truly spam, 2 ham wrong). Misses 4 spam (predicted ham).

Count	Value
TP	6
FP	2
FN	4
TN	88

Metric	Formula	Result
Accuracy	(6+88)/100	94%
Precision	6/(6+2)	75% — 1 in 4 spam flags wrong
Recall	6/(6+4)	60% — 40% of spam slipped through
Specificity	88/(88+2)	97.8% — ham mostly safe
F1	see below	66.7%

Insight: 94% accuracy still allows painful spam and false flags. Always report the matrix (or TP/FP/FN/TN) plus precision and recall.

F1 score — one number for balance

When you need a single score and both false positives and false negatives matter:

$\text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$

For the example above: F1 = 2 × (0.75 × 0.60) / (0.75 + 0.60) ≈ 0.667.

F1	Interpretation
High	Both precision and recall are decent
Low while accuracy looks high	Class imbalance or many easy negatives masking failure on positives

F-beta generalizes F1: F₂ weights recall higher (catch spam); F₀.₅ weights precision higher (protect ham). Mentioned for product conversations — F1 is the default balance.

Decision threshold — metrics are not fixed

Most classifiers output a probability or score, not a hard label. You choose a threshold:

python

# scores: P(spam) from logistic regression
y_pred = [1 if s >= 0.5 else 0 for s in scores]  # 0.5 is default, not sacred

Figure

Precision vs recall as threshold moves

Loosen threshold → recall rises, precision often falls (and vice versa).

Threshold	Typical effect
Lower (e.g. 0.3)	Predict more spam → recall ↑, precision ↓
Higher (e.g. 0.8)	Predict less spam → precision ↑, recall ↓

Tune on validation, not test. The spam project: try a few thresholds on val, pick one that matches product goals, then report final numbers on test once.

ROC curve & AUC (preview)

For many thresholds, plot TPR (recall) vs FPR:

ROC curve — how well the model ranks positives above negatives across thresholds.
AUC (area under curve) — 1.0 = perfect ranking, 0.5 = random.

When ROC helps: balanced-ish problems, comparing models independent of one threshold.

Caveat on imbalanced spam data: ROC can look optimistic when negatives dominate. Also check a precision–recall (PR) curve, which focuses on the positive class.

You do not need to implement ROC in Module 2 — but know it exists when stakeholders ask “what’s the AUC?”

Multi-class confusion matrix

With 3+ classes (e.g. cat, dog, bird), the matrix is K × K:

Rows = actual class
Columns = predicted class
Diagonal = correct predictions
Off-diagonal = confusions (cat called dog, etc.)

Per-class precision/recall/F1 are computed by treating one class as positive and all others as negative (one-vs-rest).

Macro vs weighted averages

Average	How	When
Macro	Mean of per-class F1 (equal weight per class)	Rare classes matter as much as common ones
Weighted	Mean weighted by class support	Reflects overall performance on the full dataset
Micro	Pool all TP/FP/FN globally	Behaves like accuracy on multi-class counts

sklearn `classification_report` (spam project)

After training, this is the fastest sanity check on test:

python

from sklearn.metrics import classification_report, confusion_matrix
 
print(confusion_matrix(y_test, y_pred))  # raw counts
print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))

Example output shape:

text

              precision    recall  f1-score   support
 
         ham       0.98      0.99      0.99        90
        spam       0.75      0.60      0.67        10
 
    accuracy                           0.94       100
   macro avg       0.87      0.80      0.83       100
weighted avg       0.96      0.94      0.95       100

Column	Meaning
precision	Per class — of predicted X, how many were X
recall	Per class — of actual X, how many caught
f1-score	Harmonic mean per class
support	Count of true labels — always read this

What to report in the spam project

Minimum test-set deliverables:

Confusion matrix (counts or labeled heatmap)
Precision, recall, F1 for spam class
Support (how many spam vs ham in test)
One sentence on threshold choice from validation

Optional stretch: PR curve plot, try class_weight='balanced' in logistic regression and compare recall.

Which metric when? (cheat sheet)

Situation	Lead metric	Also watch
Balanced classes, equal error cost	Accuracy	Confusion matrix
Imbalanced (spam, fraud)	Precision and recall (or F1)	Support, matrix
Missing positives is costly (fraud, disease)	Recall	FPR, precision
False alarms are costly (ham → spam)	Precision	Specificity
Compare models before picking threshold	ROC-AUC or PR-AUC	Curves at multiple thresholds
Multi-class	Per-class F1 + macro or weighted	Full matrix

When is accuracy a bad sole metric?

Imbalanced classes — majority-class classifier wins accuracy while failing the minority.
Unequal error costs — missing fraud hurts more than a false alert.
Safety-critical domains — false negatives can be catastrophic.

Use the confusion matrix first, then precision/recall/F1 (and curves later in your career).

Checkpoint

A model has precision 99% and recall 10% for spam. Describe the user experience.
FP increases. Which metrics move down — precision, recall, both, or neither?
Why report support alongside F1?
You tune threshold on the test set to maximize F1. What rule did you break?

Answers

Almost every spam flag is correct (rare false positives), but most spam still arrives in the inbox — low recall.
Precision drops (more predicted positives are wrong). Recall is unchanged if TP and FN stay the same.
F1 on 10 spam examples vs 10,000 spam examples is not comparable — support shows class counts.
Test set is for final reporting only — threshold tuning belongs on validation to avoid optimistic bias.

What's next

Module 2 quiz — test your understanding, then the spam classifier project where you will plot the confusion matrix for real.

Metrics — confusion matrix, precision, recall & more

Before we begin

What you will learn

Before this lesson

The confusion matrix — your evaluation dashboard

How to build it from predictions

Reading the matrix

Raw counts vs normalized matrix

Metrics derived from the matrix

Accuracy

Precision (positive predictive value)

Recall (sensitivity, true positive rate)

Specificity (true negative rate)

False positive rate

Worked numeric example

F1 score — one number for balance

Decision threshold — metrics are not fixed

ROC curve & AUC (preview)

Multi-class confusion matrix

Macro vs weighted averages

sklearn classification_report (spam project)

What to report in the spam project

Which metric when? (cheat sheet)

When is accuracy a bad sole metric?

Checkpoint

What's next

sklearn `classification_report` (spam project)