Project: sentiment analysis + dashboard
Before we begin
Build a product review sentiment system:
- Input: review text (CSV of Amazon/Flipkart-style reviews)
- Output: positive / negative
- Plus: a dashboard charting sentiment trends over time or by product category
Figure
End-to-end pipeline
How this connects to Module 4
| Lesson | Where you use it |
|---|---|
| CNNs | Optional stretch — 1D conv on token sequences |
| RNNs / LSTM | nn.LSTM reads word sequence left-to-right |
| Embeddings | nn.Embedding maps token IDs → dense vectors |
| Sequence padding | Batch variable-length reviews with pad_sequence |
| Transfer learning | Optional GloVe vectors in embedding layer |
Folder layout:
sentiment-project/
data/
reviews.csv
train.csv / val.csv / test.csv
vocab.json
python/
prepare_reviews.py
train_baseline.py
train_lstm.py
serve.py
reports/
baseline_metrics.txt
lstm_confusion.png
model/
lstm.pt
vocab.jsonWhat you will build
- Clean and split a public review dataset (train / val / test).
- Baseline: TF-IDF + logistic regression.
- LSTM with embedding layer (scratch or pre-trained vectors).
- Compare precision/recall/F1 on test set.
- Next.js API for single-review prediction.
- Dashboard page — sentiment ratio over time, sample table of recent predictions.
Estimated time: 4–6 hours.
Before you start
- Finish the Module 4 quiz.
pip install torch pandas matplotlib scikit-learn
Use a dataset such as Amazon Product Reviews (Kaggle) or any CSV with text, label (0/1 or pos/neg), and optional date, category. Keep a held-out test set untouched until the end.
Step 1 — Prepare data
# python/prepare_reviews.py
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
root = Path(__file__).resolve().parent.parent
df = pd.read_csv(root / "data" / "reviews.csv")
df = df.dropna(subset=["text", "label"])
df["text"] = df["text"].str.lower().str.strip()
# Map pos/neg strings to 0/1 if needed
if df["label"].dtype == object:
df["label"] = df["label"].map({"neg": 0, "negative": 0, "pos": 1, "positive": 1})
train, temp = train_test_split(df, test_size=0.3, stratify=df["label"], random_state=42)
val, test = train_test_split(temp, test_size=0.5, stratify=temp["label"], random_state=42)
data = root / "data"
train.to_csv(data / "train.csv", index=False)
val.to_csv(data / "val.csv", index=False)
test.to_csv(data / "test.csv", index=False)
print(len(train), len(val), len(test))Token vocabulary (train only):
# python/vocab.py
import json
import re
from collections import Counter
from pathlib import Path
PAD, UNK = "<pad>", "<unk>"
def tokenize(text: str) -> list[str]:
return re.findall(r"[a-z']+", text.lower())
def build_vocab(texts, max_vocab=15000):
counts = Counter()
for t in texts:
counts.update(tokenize(t))
words = [w for w, _ in counts.most_common(max_vocab - 2)]
stoi = {PAD: 0, UNK: 1, **{w: i + 2 for i, w in enumerate(words)}}
return stoi
def encode(text, stoi, max_len=200):
ids = [stoi.get(w, stoi["<unk>"]) for w in tokenize(text)][:max_len]
return ids
# save: json.dump(stoi, open("data/vocab.json","w"))Never build vocabulary from val/test — that leaks future word statistics.
Step 2 — Baseline (sklearn)
# python/train_baseline.py
import pandas as pd
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
root = Path(__file__).resolve().parent.parent
train = pd.read_csv(root / "data" / "train.csv")
val = pd.read_csv(root / "data" / "val.csv")
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train["text"])
X_val = vectorizer.transform(val["text"])
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, train["label"])
report = classification_report(val["label"], clf.predict(X_val))
print(report)
(root / "reports" / "baseline_metrics.txt").write_text(report)Save this F1 — your LSTM should beat or match it with enough data. If baseline wins, check data size and LSTM hyperparameters before over-tuning the neural net.
Step 3 — LSTM model + DataLoader
# python/train_lstm.py
import json
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import pandas as pd
from pathlib import Path
from vocab import encode, PAD
class ReviewDataset(Dataset):
def __init__(self, df, stoi, max_len=200):
self.df = df.reset_index(drop=True)
self.stoi = stoi
self.max_len = max_len
def __len__(self):
return len(self.df)
def __getitem__(self, i):
row = self.df.iloc[i]
ids = torch.tensor(encode(row["text"], self.stoi, self.max_len), dtype=torch.long)
label = torch.tensor(float(row["label"]), dtype=torch.float32)
return ids, label
def collate(batch):
seqs, labels = zip(*batch)
lengths = torch.tensor([len(s) for s in seqs])
padded = pad_sequence(seqs, batch_first=True, padding_value=0)
return padded, lengths, torch.stack(labels)
class SentimentLSTM(nn.Module):
def __init__(self, vocab_size, embed_dim=128, hidden=128):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden, batch_first=True)
self.fc = nn.Linear(hidden, 1)
def forward(self, x, lengths):
emb = self.embedding(x)
packed = nn.utils.rnn.pack_padded_sequence(
emb, lengths.cpu(), batch_first=True, enforce_sorted=False
)
_, (h_n, _) = self.lstm(packed)
return self.fc(h_n[-1]).squeeze(-1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
root = Path(__file__).resolve().parent.parent
stoi = json.loads((root / "data" / "vocab.json").read_text())
train_ds = ReviewDataset(pd.read_csv(root / "data" / "train.csv"), stoi)
val_ds = ReviewDataset(pd.read_csv(root / "data" / "val.csv"), stoi)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, collate_fn=collate)
val_loader = DataLoader(val_ds, batch_size=128, collate_fn=collate)
model = SentimentLSTM(len(stoi)).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)What each piece does:
| Component | Role |
|---|---|
pad_sequence | Pads shorter reviews with 0 (PAD) to same length in batch |
pack_padded_sequence | LSTM skips PAD tokens — faster and correct |
BCEWithLogitsLoss | Binary classification; model outputs one logit |
h_n[-1] | Last layer hidden state = sentence summary vector |
Step 4 — Training with early stopping
from sklearn.metrics import f1_score
import numpy as np
best_f1, patience, bad_epochs = 0.0, 3, 0
history = {"train_loss": [], "val_f1": []}
for epoch in range(20):
model.train()
losses = []
for x, lengths, y in train_loader:
x, lengths, y = x.to(device), lengths.to(device), y.to(device)
optimizer.zero_grad()
logits = model(x, lengths)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
losses.append(loss.item())
model.eval()
preds, labels = [], []
with torch.no_grad():
for x, lengths, y in val_loader:
x, lengths = x.to(device), lengths.to(device)
prob = torch.sigmoid(model(x, lengths)).cpu().numpy()
preds.extend((prob >= 0.5).astype(int))
labels.extend(y.numpy().astype(int))
val_f1 = f1_score(labels, preds)
history["train_loss"].append(np.mean(losses))
history["val_f1"].append(val_f1)
print(f"epoch {epoch+1} loss={np.mean(losses):.4f} val_f1={val_f1:.4f}")
if val_f1 > best_f1:
best_f1 = val_f1
bad_epochs = 0
torch.save(model.state_dict(), root / "model" / "lstm.pt")
else:
bad_epochs += 1
if bad_epochs >= patience:
print("early stop")
breakPlot train_loss vs val_f1. Rising train accuracy with flat val F1 → overfitting.
Optional — GloVe: load 50d or 100d vectors into embedding.weight for known words; set requires_grad=True to fine-tune.
Step 5 — Test set evaluation (once)
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
test_ds = ReviewDataset(pd.read_csv(root / "data" / "test.csv"), stoi)
test_loader = DataLoader(test_ds, batch_size=128, collate_fn=collate)
model.load_state_dict(torch.load(root / "model" / "lstm.pt", weights_only=True))
model.eval()
# ... same predict loop → classification_report on TEST onlyWrite one paragraph: where does the model fail? (sarcasm, negation "not good", very short reviews?)
Step 6 — Flask inference + Next.js API
# python/serve.py
@app.post("/predict")
def predict():
text = request.json.get("text", "")
ids = torch.tensor([encode(text, stoi)], dtype=torch.long)
lengths = torch.tensor([ids.size(1)])
with torch.no_grad():
logit = model(ids, lengths)
score = torch.sigmoid(logit).item()
label = "positive" if score >= 0.5 else "negative"
return jsonify({"label": label, "score": score})// app/api/sentiment/route.ts
// POST { text: string } → { label: "positive" | "negative", score: number }Step 7 — Sentiment dashboard (Next.js)
Create app/sentiment-lab/page.tsx:
- Line chart — % positive reviews per week (group CSV
datecolumn withd3or Recharts). - Bar chart — sentiment by
categoryif available. - Table — last 20 API predictions with text snippet + score.
Make it meaningful: compare two categories or two months — not just one static number.
Example aggregation (server or client):
// bucket reviews by ISO week → { week: "2024-W12", positiveRate: 0.72 }Troubleshooting
| Symptom | Fix |
|---|---|
| LSTM worse than TF-IDF | More data, lower max_len, try bidirectional LSTM |
| Val F1 = 0 always | Labels not 0/1; check BCEWithLogitsLoss + sigmoid at predict |
| OOM on GPU | Reduce batch_size or hidden size |
| Dashboard empty | Ensure CSV has parseable date column |
Deliverables
-
train.csv/val.csv/test.csvwith no leakage - Baseline + LSTM test metrics in README
- Working
POST /api/sentiment - Dashboard screenshot or live page link
- Error analysis: 5 misclassified reviews explained
What's next
Module 4 complete.
Want more vision depth? Continue to Module 5 — Image segmentation — U-Net from scratch plus FCN, DeepLab, and Mask R-CNN (recommended before transformers).
Or jump to Module 6 — Transformers (core of GenAI) when ready.
Return to the AI course curriculum anytime to track progress.