← Back to curriculum

Module 2 — Core machine learning

Project: spam classifier with API & MongoDB

Train logistic regression in Python, report confusion matrix metrics, expose a Node.js classify endpoint, and log predictions to MongoDB.

~180 min read + exercises

Project: spam classifier with API and MongoDB

Before we begin

This is a full-stack ML mini-project: train a classifier in Python, measure it properly, then serve predictions through Node.js and log every result in MongoDB.

Figure

End-to-end pipeline

TrainPythonModelweightsAPINode.jsStoreMongoDB
Train offline → export model → API predicts → MongoDB logs results.

How this connects to Module 2

LessonWhere you use it
Supervised learningLabeled ham / spam emails
ClassificationPredict discrete label, not a number
Train / val / testThree splits — never tune on test
Precision & recallConfusion matrix on held-out test set
OverfittingCompare val F1 vs test F1

What you will build

PieceTechPurpose
TrainerPython + scikit-learnLearn spam vs ham from labeled messages
Metricsmatplotlib + sklearnConfusion matrix, precision, recall
InferenceFlask (Python)Load model, return label + probability
APINode.js + ExpressPOST /classify for clients
DatabaseMongoDBLog text, label, score, timestamp

Estimated time: 3–5 hours.


Before you start

  • Finish Module 2 quiz.
  • pip install scikit-learn pandas numpy matplotlib joblib flask
  • Node.js 18+ and MongoDB (local or Atlas free tier).
text
spam-classifier/
  data/
    messages.csv
    train.csv / val.csv / test.csv
  python/
    prepare_data.py
    train.py
    serve.py
  server/
    model/spam_model.joblib
    index.js
    package.json
  reports/
    confusion_matrix.png

Step 1 — Labeled data + splits

Goal: CSV with label and text columns, split before any training.

Download UCI SMS Spam Collection (or similar). Save as data/messages.csv:

csv
label,text
ham,"Hey are we still meeting today?"
spam,"WINNER!! Claim your prize now click here"
python
# python/prepare_data.py
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
 
root = Path(__file__).resolve().parent.parent / "data"
df = pd.read_csv(root / "messages.csv")
 
# stratify=label keeps spam ratio similar in each split
train, temp = train_test_split(df, test_size=0.3, stratify=df["label"], random_state=42)
val, test = train_test_split(temp, test_size=0.5, stratify=temp["label"], random_state=42)
 
train.to_csv(root / "train.csv", index=False)
val.to_csv(root / "val.csv", index=False)
test.to_csv(root / "test.csv", index=False)
print("train", len(train), "val", len(val), "test", len(test))

Why stratify? If spam is 10% of data, each split should stay ~10% spam. Without it, validation might have zero spam rows — metrics become meaningless.

SplitUse
TrainFit vectorizer + classifier weights
ValidationTune threshold / compare models
TestOnce at the end for honest F1

Step 2 — Train with TF-IDF + logistic regression

Goal: Turn raw text into numbers, then learn a linear classifier.

python
# python/train.py
import joblib
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
 
root = Path(__file__).resolve().parent.parent
data = root / "data"
reports = root / "reports"
reports.mkdir(exist_ok=True)
(root / "server" / "model").mkdir(parents=True, exist_ok=True)
 
train = pd.read_csv(data / "train.csv")
val = pd.read_csv(data / "val.csv")
test = pd.read_csv(data / "test.csv")
 
pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ("clf", LogisticRegression(max_iter=1000)),
])
 
pipe.fit(train["text"], train["label"])

What each piece does:

ComponentRole
TfidfVectorizerBuilds vocabulary from training text only; converts each message to a sparse vector of word weights
ngram_range=(1,2)Uses single words + pairs ("free money") — catches spam phrases
LogisticRegressionLinear classifier; outputs ham/spam despite the name "regression"
PipelineRuns TF-IDF then classifier in one object — same transforms at train and predict time

Evaluate on val and test:

python
for name, split in [("validation", val), ("test", test)]:
    pred = pipe.predict(split["text"])
    print(f"--- {name} ---")
    print(confusion_matrix(split["label"], pred, labels=["ham", "spam"]))
    print(classification_report(split["label"], pred))
 
# Plot confusion matrix on TEST only for deliverable
pred_test = pipe.predict(test["text"])
disp = ConfusionMatrixDisplay(confusion_matrix(test["label"], pred_test, labels=["ham", "spam"]), display_labels=["ham", "spam"])
disp.plot()
plt.savefig(reports / "confusion_matrix.png", dpi=120, bbox_inches="tight")
 
joblib.dump(pipe, root / "server" / "model" / "spam_model.joblib")
print("Saved model")

Alternative: MultinomialNB() — strong baseline on word counts.


Step 3 — Read your metrics

Figure

Confusion matrix reminder

Confusion matrix (spam = positive class)PredictedActualspamhamspamhamTPspam→spamFPham→spamFNspam→hamTNham→ham
FP = ham marked spam. FN = spam missed.
MetricSpam-filter question
PrecisionWhen we flag spam, how often are we right?
RecallOf all real spam, how much did we catch?

Low recall → spam reaches inbox. Low precision → real mail in spam folder.

Threshold tuning (optional): pipe.predict_proba(text) returns probabilities. Lower spam threshold to increase recall.


Step 4 — Python inference service (Flask)

Node will not run sklearn directly — call a small Python server:

python
# python/serve.py
from flask import Flask, request, jsonify
import joblib
from pathlib import Path
 
app = Flask(__name__)
pipe = joblib.load(Path(__file__).resolve().parent.parent / "server" / "model" / "spam_model.joblib")
 
@app.post("/predict")
def predict():
    text = request.json.get("text", "")
    if not text.strip():
        return jsonify({"error": "text required"}), 400
    proba = pipe.predict_proba([text])[0]
    classes = list(pipe.classes_)
    spam_idx = classes.index("spam")
    return jsonify({
        "label": pipe.predict([text])[0],
        "spam_probability": float(proba[spam_idx]),
    })
 
if __name__ == "__main__":
    app.run(port=5001, debug=False)

Run: python python/serve.py — keep this terminal open while testing Node.


Step 5 — Node.js API + MongoDB

bash
cd server && npm init -y && npm install express mongodb cors dotenv axios

Add "type": "module" to package.json.

javascript
// server/index.js
import express from "express";
import cors from "cors";
import axios from "axios";
import { MongoClient } from "mongodb";
import "dotenv/config";
 
const app = express();
app.use(cors());
app.use(express.json());
 
const mongo = new MongoClient(process.env.MONGODB_URI || "mongodb://127.0.0.1:27017");
const PY_URL = process.env.PY_PREDICT_URL || "http://127.0.0.1:5001/predict";
 
app.post("/classify", async (req, res) => {
  try {
    const text = String(req.body.text || "").trim();
    if (!text) return res.status(400).json({ error: "text required" });
 
    const { data } = await axios.post(PY_URL, { text });
    const doc = {
      text: text.slice(0, 500),  // truncate for privacy in demos
      label: data.label,
      spamProbability: data.spam_probability,
      createdAt: new Date(),
    };
 
    await mongo.db("spam_lab").collection("predictions").insertOne(doc);
    res.json(doc);
  } catch (err) {
    console.error(err);
    res.status(502).json({ error: "prediction service unavailable" });
  }
});
 
app.get("/health", (_, res) => res.json({ ok: true }));
 
const port = process.env.PORT || 4000;
await mongo.connect();
app.listen(port, () => console.log(`API on :${port}`));

Request flow:

  1. Client POST /classify with { "text": "..." }.
  2. Node forwards text to Flask /predict.
  3. Python loads same Pipeline object → TF-IDF + logistic regression.
  4. Node stores result in MongoDB and returns JSON to client.

Test:

bash
curl -X POST http://localhost:4000/classify \
  -H "Content-Type: application/json" \
  -d "{\"text\":\"Free money click now!!!\"}"

Step 6 — Meaningful experiments

Document in README:

  • Test precision / recall / F1 on test set (not validation).
  • Train on only 50 emails — what happens? (underfitting)
  • Val F1 much higher than test F1? (overfitting or lucky split)

Troubleshooting

SymptomFix
ECONNREFUSED on classifyStart Flask on port 5001 first
Validation has 0 spamRe-run prepare_data.py with stratify=
Perfect train, bad testOverfitting — reduce max_features or add data
Mongo connection failedStart mongod or fix Atlas URI in .env

Deliverables

  • train.csv / val.csv / test.csv
  • spam_model.joblib + confusion_matrix.png
  • Test set precision & recall written in README
  • POST /classify + MongoDB documents visible
  • Short note: which metric you optimized for and why

What's next

Module 2 complete. Continue to Module 3 — Neural networks when ready.

Return to the AI course curriculum anytime.