Train, validation, and test splits

Before we begin

You need unseen data to know if a model generalizes. But you also need data to make decisions while building (which learning rate? which algorithm?). That is why we use three splits:

Train — learn weights.
Validation — tune choices without touching the final exam.
Test — one honest score at the end.

Figure

Three buckets

Typical split: ~70% train, ~15% validation, ~15% test (adjust for dataset size).

What you will learn

State the purpose of train, validation, and test sets.
Explain why validation data is necessary.
Recognize data leakage and avoid it.

Before this lesson

Lesson 3 — Overfitting & underfitting

Train set — where learning happens

The model updates weights using train examples only (in standard supervised learning).

Typical size: 60–80% of data for medium datasets.

What you do here: fit models, compute training loss, run gradient descent.

What you must not do: decide your final reported performance using train accuracy alone — training always looks optimistic.

Validation set — development mirror

Used during project iteration:

Compare logistic regression vs Naive Bayes.
Pick learning rate or regularization strength.
Decide when to early stop training.

Typical size: 10–20%.

Why separate from test? Every time you peek at test and change something, test stops being an unbiased “future.” Validation absorbs those choices.

Analogy: Practice exams with answer keys. You learn from them — they are not the final certification exam.

Test set — final exam

Touch once (ideally) when development is done.

Reports: “We expect ~X% precision on similar future data.”

Typical size: 10–20%.

If you repeatedly tune on test, you overfit the test set — scores look great in slides, fail in production.

Why we need validation (not just train + test)

Without validation, developers either:

Tune on test → inflated claims, or
Tune on train → overfitting to training quirks.

Validation is the safe sandbox for comparisons.

Worked example — split 1,000 emails

Split	Count	Use
Train	700	Fit spam classifier weights
Validation	150	Pick threshold / compare models
Test	150	Final precision/recall report

Rule: test emails never appear in train. Same for duplicate near-copies (dedupe carefully).

Data leakage — silent score inflation

Leakage = information from validation/test influences training.

Examples:

Normalizing using global mean/std computed on all data before splitting.
Putting duplicate emails in both train and test.
Using future data to predict the past in time-series.
Building features from labels accidentally.

Fix: split first (or split by time/user), then compute statistics on train only, apply to val/test.

Checkpoint: You merge train+test to “get more training data,” then split again randomly. What broke?

Answer sketch

If you already peeked at test during exploration or tuning, you leaked information. Split once at the start and lock test away.

Stratified splits (binary classification)

When spam is 5% of data, random splits might give val with 0% spam. Use stratified splitting to keep class ratios similar in each bucket (stratify=y in scikit-learn).

Common mistakes

“We have no validation set — we use test for tuning.”
Shuffling time-ordered fraud data without time-based splits.
Reporting test score after 20 rounds of changes based on that same test.

What's next

Lesson 5 — Metrics: accuracy, precision, recall — numbers that match business goals.