Project: spam classifier with API and MongoDB
Before we begin
This is a full-stack ML mini-project: train a classifier in Python, measure it properly, then serve predictions through Node.js and log every result in MongoDB.
Figure
End-to-end pipeline
How this connects to Module 2
| Lesson | Where you use it |
|---|---|
| Supervised learning | Labeled ham / spam emails |
| Classification | Predict discrete label, not a number |
| Train / val / test | Three splits — never tune on test |
| Precision & recall | Confusion matrix on held-out test set |
| Overfitting | Compare val F1 vs test F1 |
What you will build
| Piece | Tech | Purpose |
|---|---|---|
| Trainer | Python + scikit-learn | Learn spam vs ham from labeled messages |
| Metrics | matplotlib + sklearn | Confusion matrix, precision, recall |
| Inference | Flask (Python) | Load model, return label + probability |
| API | Node.js + Express | POST /classify for clients |
| Database | MongoDB | Log text, label, score, timestamp |
Estimated time: 3–5 hours.
Before you start
- Finish Module 2 quiz.
pip install scikit-learn pandas numpy matplotlib joblib flask- Node.js 18+ and MongoDB (local or Atlas free tier).
spam-classifier/
data/
messages.csv
train.csv / val.csv / test.csv
python/
prepare_data.py
train.py
serve.py
server/
model/spam_model.joblib
index.js
package.json
reports/
confusion_matrix.pngStep 1 — Labeled data + splits
Goal: CSV with label and text columns, split before any training.
Download UCI SMS Spam Collection (or similar). Save as data/messages.csv:
label,text
ham,"Hey are we still meeting today?"
spam,"WINNER!! Claim your prize now click here"# python/prepare_data.py
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
root = Path(__file__).resolve().parent.parent / "data"
df = pd.read_csv(root / "messages.csv")
# stratify=label keeps spam ratio similar in each split
train, temp = train_test_split(df, test_size=0.3, stratify=df["label"], random_state=42)
val, test = train_test_split(temp, test_size=0.5, stratify=temp["label"], random_state=42)
train.to_csv(root / "train.csv", index=False)
val.to_csv(root / "val.csv", index=False)
test.to_csv(root / "test.csv", index=False)
print("train", len(train), "val", len(val), "test", len(test))Why stratify? If spam is 10% of data, each split should stay ~10% spam. Without it, validation might have zero spam rows — metrics become meaningless.
| Split | Use |
|---|---|
| Train | Fit vectorizer + classifier weights |
| Validation | Tune threshold / compare models |
| Test | Once at the end for honest F1 |
Step 2 — Train with TF-IDF + logistic regression
Goal: Turn raw text into numbers, then learn a linear classifier.
# python/train.py
import joblib
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
root = Path(__file__).resolve().parent.parent
data = root / "data"
reports = root / "reports"
reports.mkdir(exist_ok=True)
(root / "server" / "model").mkdir(parents=True, exist_ok=True)
train = pd.read_csv(data / "train.csv")
val = pd.read_csv(data / "val.csv")
test = pd.read_csv(data / "test.csv")
pipe = Pipeline([
("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(train["text"], train["label"])What each piece does:
| Component | Role |
|---|---|
TfidfVectorizer | Builds vocabulary from training text only; converts each message to a sparse vector of word weights |
ngram_range=(1,2) | Uses single words + pairs ("free money") — catches spam phrases |
LogisticRegression | Linear classifier; outputs ham/spam despite the name "regression" |
Pipeline | Runs TF-IDF then classifier in one object — same transforms at train and predict time |
Evaluate on val and test:
for name, split in [("validation", val), ("test", test)]:
pred = pipe.predict(split["text"])
print(f"--- {name} ---")
print(confusion_matrix(split["label"], pred, labels=["ham", "spam"]))
print(classification_report(split["label"], pred))
# Plot confusion matrix on TEST only for deliverable
pred_test = pipe.predict(test["text"])
disp = ConfusionMatrixDisplay(confusion_matrix(test["label"], pred_test, labels=["ham", "spam"]), display_labels=["ham", "spam"])
disp.plot()
plt.savefig(reports / "confusion_matrix.png", dpi=120, bbox_inches="tight")
joblib.dump(pipe, root / "server" / "model" / "spam_model.joblib")
print("Saved model")Alternative: MultinomialNB() — strong baseline on word counts.
Step 3 — Read your metrics
Figure
Confusion matrix reminder
| Metric | Spam-filter question |
|---|---|
| Precision | When we flag spam, how often are we right? |
| Recall | Of all real spam, how much did we catch? |
Low recall → spam reaches inbox. Low precision → real mail in spam folder.
Threshold tuning (optional): pipe.predict_proba(text) returns probabilities. Lower spam threshold to increase recall.
Step 4 — Python inference service (Flask)
Node will not run sklearn directly — call a small Python server:
# python/serve.py
from flask import Flask, request, jsonify
import joblib
from pathlib import Path
app = Flask(__name__)
pipe = joblib.load(Path(__file__).resolve().parent.parent / "server" / "model" / "spam_model.joblib")
@app.post("/predict")
def predict():
text = request.json.get("text", "")
if not text.strip():
return jsonify({"error": "text required"}), 400
proba = pipe.predict_proba([text])[0]
classes = list(pipe.classes_)
spam_idx = classes.index("spam")
return jsonify({
"label": pipe.predict([text])[0],
"spam_probability": float(proba[spam_idx]),
})
if __name__ == "__main__":
app.run(port=5001, debug=False)Run: python python/serve.py — keep this terminal open while testing Node.
Step 5 — Node.js API + MongoDB
cd server && npm init -y && npm install express mongodb cors dotenv axiosAdd "type": "module" to package.json.
// server/index.js
import express from "express";
import cors from "cors";
import axios from "axios";
import { MongoClient } from "mongodb";
import "dotenv/config";
const app = express();
app.use(cors());
app.use(express.json());
const mongo = new MongoClient(process.env.MONGODB_URI || "mongodb://127.0.0.1:27017");
const PY_URL = process.env.PY_PREDICT_URL || "http://127.0.0.1:5001/predict";
app.post("/classify", async (req, res) => {
try {
const text = String(req.body.text || "").trim();
if (!text) return res.status(400).json({ error: "text required" });
const { data } = await axios.post(PY_URL, { text });
const doc = {
text: text.slice(0, 500), // truncate for privacy in demos
label: data.label,
spamProbability: data.spam_probability,
createdAt: new Date(),
};
await mongo.db("spam_lab").collection("predictions").insertOne(doc);
res.json(doc);
} catch (err) {
console.error(err);
res.status(502).json({ error: "prediction service unavailable" });
}
});
app.get("/health", (_, res) => res.json({ ok: true }));
const port = process.env.PORT || 4000;
await mongo.connect();
app.listen(port, () => console.log(`API on :${port}`));Request flow:
- Client
POST /classifywith{ "text": "..." }. - Node forwards text to Flask
/predict. - Python loads same
Pipelineobject → TF-IDF + logistic regression. - Node stores result in MongoDB and returns JSON to client.
Test:
curl -X POST http://localhost:4000/classify \
-H "Content-Type: application/json" \
-d "{\"text\":\"Free money click now!!!\"}"Step 6 — Meaningful experiments
Document in README:
- Test precision / recall / F1 on test set (not validation).
- Train on only 50 emails — what happens? (underfitting)
- Val F1 much higher than test F1? (overfitting or lucky split)
Troubleshooting
| Symptom | Fix |
|---|---|
ECONNREFUSED on classify | Start Flask on port 5001 first |
| Validation has 0 spam | Re-run prepare_data.py with stratify= |
| Perfect train, bad test | Overfitting — reduce max_features or add data |
| Mongo connection failed | Start mongod or fix Atlas URI in .env |
Deliverables
-
train.csv/val.csv/test.csv -
spam_model.joblib+confusion_matrix.png - Test set precision & recall written in README
-
POST /classify+ MongoDB documents visible - Short note: which metric you optimized for and why
What's next
Module 2 complete. Continue to Module 3 — Neural networks when ready.
Return to the AI course curriculum anytime.