Supervised vs unsupervised learning
Before we begin
Every ML tutorial mentions supervised learning. Here is the entire idea in one sentence:
Supervised learning = learning from examples that include the correct answer.
Unsupervised learning = no correct answers provided — the algorithm finds structure on its own (clusters, patterns, anomalies).
Knowing which type you have determines everything: data labeling cost, metrics, and project design.
Figure
Labels vs no labels
What you will learn
- Define supervised and unsupervised learning in plain language.
- Name real examples of each.
- Recognize which type a new problem belongs to.
Before this lesson
- Module 2 welcome
- Module 1 welcome — key concepts (model, training, labels)
Supervised learning — learning with a teacher
Imagine flashcards:
- Front: email text
- Back: label says spam or not spam
The model sees thousands of fronts with known backs. Training adjusts weights so predictions match the backs.
Other supervised examples:
| Input | Label (correct answer) |
|---|---|
| House size, bedrooms | Price in dollars |
| Photo of a digit | Digit 0–9 |
| Review text | Positive or negative |
| Medical scan | Disease present yes/no |
What you need: labeled dataset. Labels often come from humans — that costs time and money.
Training goal: minimize prediction error vs labels (same spirit as Module 1 gradient descent, often with different loss functions).
Unsupervised learning — finding structure alone
No flashcard backs. You only get inputs.
Examples:
- Clustering — group customers by behavior without predefined segments.
- Topic modeling — discover themes in thousands of documents.
- Anomaly detection — flag transactions unlike typical ones.
- Dimensionality reduction — compress high-dimensional data for visualization.
The algorithm might output cluster IDs or scores — but nobody told it the “right” groups in advance.
Semi-supervised (quick note)
Real projects sometimes have few labels + many unlabeled examples. Techniques mix both. You will see this again with modern LLMs (pre-train on text, fine-tune on labels). For now, know the name exists.
Worked example — classify the problem
| Problem | Supervised or unsupervised? | Why |
|---|---|---|
| Predict house price from listings with sold prices | Supervised | Sold price is the label |
| Group news articles by topic without tags | Unsupervised | No topic labels given |
| Detect spam with 10,000 labeled emails | Supervised | spam/ham labels exist |
| Find unusual login patterns without fraud labels | Often unsupervised / anomaly | No labeled fraud needed upfront |
Checkpoint: You have 1M product reviews without star ratings. You want to discover common complaint themes. Supervised or unsupervised?
Answer sketch
Unsupervised (or topic modeling). You are discovering structure, not predicting a provided label.
Common mistakes
- Calling any ML “supervised” when labels are missing.
- Using test labels during training (that is cheating — covered in Lesson 4).
- Assuming unsupervised outputs are “true” clusters — they are useful views, not ground truth.
Why it matters for your project
The spam classifier is supervised: each training email has a spam or ham label. Module 2 project = classic supervised classification.
What's next
Lesson 2 — Regression vs classification — once you have labels, is the answer a number or a category?