Project: spam classifier with API and MongoDB

Before we begin

This is a full-stack ML mini-project: train a classifier in Python, measure it properly, then serve predictions through Node.js and log every result in MongoDB.

Figure

End-to-end pipeline

Train offline → export model → API predicts → MongoDB logs results.

How this connects to Module 2

Lesson	Where you use it
Supervised learning	Labeled `ham` / `spam` emails
Classification	Predict discrete label, not a number
Train / val / test	Three splits — never tune on test
Precision & recall	Confusion matrix on held-out test set
Overfitting	Compare val F1 vs test F1

What you will build

Piece	Tech	Purpose
Trainer	Python + scikit-learn	Learn spam vs ham from labeled messages
Metrics	matplotlib + sklearn	Confusion matrix, precision, recall
Inference	Flask (Python)	Load model, return label + probability
API	Node.js + Express	`POST /classify` for clients
Database	MongoDB	Log text, label, score, timestamp

Estimated time: 3–5 hours.

Before you start

Finish Module 2 quiz.
pip install scikit-learn pandas numpy matplotlib joblib flask
Node.js 18+ and MongoDB (local or Atlas free tier).

text

spam-classifier/
  data/
    messages.csv
    train.csv / val.csv / test.csv
  python/
    prepare_data.py
    train.py
    serve.py
  server/
    model/spam_model.joblib
    index.js
    package.json
  reports/
    confusion_matrix.png

Step 1 — Labeled data + splits

Goal: CSV with label and text columns, split before any training.

Download UCI SMS Spam Collection (or similar). Save as data/messages.csv:

csv

label,text
ham,"Hey are we still meeting today?"
spam,"WINNER!! Claim your prize now click here"

python

# python/prepare_data.py
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
 
root = Path(__file__).resolve().parent.parent / "data"
df = pd.read_csv(root / "messages.csv")
 
# stratify=label keeps spam ratio similar in each split
train, temp = train_test_split(df, test_size=0.3, stratify=df["label"], random_state=42)
val, test = train_test_split(temp, test_size=0.5, stratify=temp["label"], random_state=42)
 
train.to_csv(root / "train.csv", index=False)
val.to_csv(root / "val.csv", index=False)
test.to_csv(root / "test.csv", index=False)
print("train", len(train), "val", len(val), "test", len(test))

Why stratify? If spam is 10% of data, each split should stay ~10% spam. Without it, validation might have zero spam rows — metrics become meaningless.

Split	Use
Train	Fit vectorizer + classifier weights
Validation	Tune threshold / compare models
Test	Once at the end for honest F1

Step 2 — Train with TF-IDF + logistic regression

Goal: Turn raw text into numbers, then learn a linear classifier.

python

# python/train.py
import joblib
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
 
root = Path(__file__).resolve().parent.parent
data = root / "data"
reports = root / "reports"
reports.mkdir(exist_ok=True)
(root / "server" / "model").mkdir(parents=True, exist_ok=True)
 
train = pd.read_csv(data / "train.csv")
val = pd.read_csv(data / "val.csv")
test = pd.read_csv(data / "test.csv")
 
pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ("clf", LogisticRegression(max_iter=1000)),
])
 
pipe.fit(train["text"], train["label"])

What each piece does:

Component	Role
`TfidfVectorizer`	Builds vocabulary from training text only; converts each message to a sparse vector of word weights
`ngram_range=(1,2)`	Uses single words + pairs (`"free money"`) — catches spam phrases
`LogisticRegression`	Linear classifier; outputs ham/spam despite the name "regression"
`Pipeline`	Runs TF-IDF then classifier in one object — same transforms at train and predict time

Evaluate on val and test:

python

for name, split in [("validation", val), ("test", test)]:
    pred = pipe.predict(split["text"])
    print(f"--- {name} ---")
    print(confusion_matrix(split["label"], pred, labels=["ham", "spam"]))
    print(classification_report(split["label"], pred))
 
# Plot confusion matrix on TEST only for deliverable
pred_test = pipe.predict(test["text"])
disp = ConfusionMatrixDisplay(confusion_matrix(test["label"], pred_test, labels=["ham", "spam"]), display_labels=["ham", "spam"])
disp.plot()
plt.savefig(reports / "confusion_matrix.png", dpi=120, bbox_inches="tight")
 
joblib.dump(pipe, root / "server" / "model" / "spam_model.joblib")
print("Saved model")

Alternative: MultinomialNB() — strong baseline on word counts.

Step 3 — Read your metrics

Figure

Confusion matrix reminder

FP = ham marked spam. FN = spam missed.

Metric	Spam-filter question
Precision	When we flag spam, how often are we right?
Recall	Of all real spam, how much did we catch?

Low recall → spam reaches inbox. Low precision → real mail in spam folder.

Threshold tuning (optional): pipe.predict_proba(text) returns probabilities. Lower spam threshold to increase recall.

Step 4 — Python inference service (Flask)

Node will not run sklearn directly — call a small Python server:

python

# python/serve.py
from flask import Flask, request, jsonify
import joblib
from pathlib import Path
 
app = Flask(__name__)
pipe = joblib.load(Path(__file__).resolve().parent.parent / "server" / "model" / "spam_model.joblib")
 
@app.post("/predict")
def predict():
    text = request.json.get("text", "")
    if not text.strip():
        return jsonify({"error": "text required"}), 400
    proba = pipe.predict_proba([text])[0]
    classes = list(pipe.classes_)
    spam_idx = classes.index("spam")
    return jsonify({
        "label": pipe.predict([text])[0],
        "spam_probability": float(proba[spam_idx]),
    })
 
if __name__ == "__main__":
    app.run(port=5001, debug=False)

Run: python python/serve.py — keep this terminal open while testing Node.

Step 5 — Node.js API + MongoDB

bash

cd server && npm init -y && npm install express mongodb cors dotenv axios

Add "type": "module" to package.json.

javascript

// server/index.js
import express from "express";
import cors from "cors";
import axios from "axios";
import { MongoClient } from "mongodb";
import "dotenv/config";
 
const app = express();
app.use(cors());
app.use(express.json());
 
const mongo = new MongoClient(process.env.MONGODB_URI || "mongodb://127.0.0.1:27017");
const PY_URL = process.env.PY_PREDICT_URL || "http://127.0.0.1:5001/predict";
 
app.post("/classify", async (req, res) => {
  try {
    const text = String(req.body.text || "").trim();
    if (!text) return res.status(400).json({ error: "text required" });
 
    const { data } = await axios.post(PY_URL, { text });
    const doc = {
      text: text.slice(0, 500),  // truncate for privacy in demos
      label: data.label,
      spamProbability: data.spam_probability,
      createdAt: new Date(),
    };
 
    await mongo.db("spam_lab").collection("predictions").insertOne(doc);
    res.json(doc);
  } catch (err) {
    console.error(err);
    res.status(502).json({ error: "prediction service unavailable" });
  }
});
 
app.get("/health", (_, res) => res.json({ ok: true }));
 
const port = process.env.PORT || 4000;
await mongo.connect();
app.listen(port, () => console.log(`API on :${port}`));

Request flow:

Client POST /classify with { "text": "..." }.
Node forwards text to Flask /predict.
Python loads same Pipeline object → TF-IDF + logistic regression.
Node stores result in MongoDB and returns JSON to client.

Test:

bash

curl -X POST http://localhost:4000/classify \
  -H "Content-Type: application/json" \
  -d "{\"text\":\"Free money click now!!!\"}"

Step 6 — Meaningful experiments

Document in README:

Test precision / recall / F1 on test set (not validation).
Train on only 50 emails — what happens? (underfitting)
Val F1 much higher than test F1? (overfitting or lucky split)

Troubleshooting

Symptom	Fix
`ECONNREFUSED` on classify	Start Flask on port 5001 first
Validation has 0 spam	Re-run `prepare_data.py` with `stratify=`
Perfect train, bad test	Overfitting — reduce `max_features` or add data
Mongo connection failed	Start `mongod` or fix Atlas URI in `.env`

Deliverables

train.csv / val.csv / test.csv
spam_model.joblib + confusion_matrix.png
Test set precision & recall written in README
POST /classify + MongoDB documents visible
Short note: which metric you optimized for and why

What's next

Module 2 complete. Continue to Module 3 — Neural networks when ready.

Return to the AI course curriculum anytime.