Project: mini transformer — next-word prediction on blog text
Before we begin
You will train a small decoder-only transformer to predict the next token — the same core training objective as GPT, at a scale that runs on a laptop.
Corpus: plain text from this site’s blog MDX files in content/blog/. Your own writing becomes the training data, so generations can echo your blog’s topics (AI, mobile, computer vision).
Figure
Blog → train → generate
How this connects to Module 6
| Lesson | What you use in this project |
|---|---|
| Attention / self-attention | Each layer mixes token vectors; causal mask blocks future tokens |
| Transformer architecture | Embedding + position + stacked blocks + linear head |
| Encoder vs decoder | You build decoder-only (GPT-style), not BERT |
| Tokenization | Word-level vocab here (BPE optional stretch goal) |
Training objective: given tokens [t₀, t₁, …, tₙ₋₁], predict [t₁, t₂, …, tₙ] — shift-by-one next-token prediction.
What you will build
- Extract and clean text from blog MDX files.
- Build a word-level tokenizer + vocabulary.
- Train a tiny transformer (2–4 layers, small width).
- Plot training loss vs epochs.
- Generate continuations from prompts like
"on device ai"or"computer vision". - Optional: CLI or simple UI to type a prompt and print output.
Estimated time: 4–6 hours.
Before you start
- Finish the Module 6 quiz.
- Python 3.10+ and
pip install torch matplotlib
Create a project folder next to your repo (or inside it):
mini-transformer/
data/ # corpus + vocab (created by scripts)
build_corpus.py
train.py
generate.py
checkpoints/ # saved model weightsFrom mini-transformer/, paths to blog MDX are typically ../content/blog if your repo root is the parent folder.
Step 1 — Build a text corpus
Goal: Turn MDX blog posts into one plain-text file the model can learn from.
# build_corpus.py
from pathlib import Path
import re
# Path from mini-transformer/ → repo content/blog
root = Path(__file__).resolve().parent.parent / "content" / "blog"
out = Path(__file__).resolve().parent / "data"
out.mkdir(parents=True, exist_ok=True)
texts = []
for path in sorted(root.glob("*.mdx")):
raw = path.read_text(encoding="utf-8")
# Remove YAML frontmatter between first --- pair
body = re.sub(r"^---.*?---\s*", "", raw, flags=re.S)
# Strip images  and links [text](url) — keep readable words only
body = re.sub(r"!\[[^\]]*\]\([^)]+\)", " ", body)
body = re.sub(r"\[[^\]]+\]\([^)]+\)", " ", body)
# Drop # headings markers but keep words; remove extra whitespace
body = re.sub(r"#+\s*", "", body)
body = re.sub(r"\s+", " ", body).strip()
if body:
texts.append(body)
corpus = "\n".join(texts).lower()
(out / "blog_corpus.txt").write_text(corpus, encoding="utf-8")
print(f"Wrote {len(corpus):,} characters from {len(texts)} posts")What each part does:
frontmatterregex — MDX starts with---metadata; models should not memorize dates and titles from YAML.- Link/image regex — removes URLs; keeps the surrounding article text.
.lower()— smaller vocab (AIandaimerge); fine for a learning project.
Run: python build_corpus.py
Expect: thousands to tens of thousands of characters depending on how many blog posts exist. If chars: 0, fix the path to content/blog.
Train / validation split (do this before training — never tune on validation):
# inside train.py later, or a small split helper
split_at = int(len(all_ids) * 0.9)
train_ids = all_ids[:split_at]
val_ids = all_ids[split_at:]Hold out 10% of token IDs so you can see if loss improves on unseen text, not just memorization.
Step 2 — Tokenizer (word-level)
Goal: Map text ↔ lists of integers the neural net can process.
# tokenizer.py (or top of train.py)
import re
import json
from collections import Counter
from pathlib import Path
text = Path("data/blog_corpus.txt").read_text(encoding="utf-8")
# Split into words and punctuation tokens
tokens = re.findall(r"[a-z0-9]+|[^\s]", text)
counts = Counter(tokens)
vocab = {"<pad>": 0, "<unk>": 1} # reserve IDs for padding and unknown words
for tok, _ in counts.most_common(8000):
vocab[tok] = len(vocab)
id_to_tok = {i: t for t, i in vocab.items()}
def encode(s: str) -> list[int]:
pieces = re.findall(r"[a-z0-9]+|[^\s]", s.lower())
return [vocab.get(t, vocab["<unk>"]) for t in pieces]
def decode(ids: list[int]) -> str:
return " ".join(id_to_tok[i] for i in ids)
Path("data/vocab.json").write_text(json.dumps(vocab), encoding="utf-8")
print("vocab size:", len(vocab))
all_ids = encode(text)What each part does:
| Piece | Meaning |
|---|---|
re.findall(...) | "on-device" → ["on", "-", "device"] — simple but works for learning |
<unk> | Words not in top 8k map to ID 1 — rare words won’t crash training |
encode / decode | Same interface real tokenizers use (string ↔ ID list) |
all_ids | Entire corpus as one long integer sequence for the dataset |
Stretch: swap in tiktoken (Lesson 5) for subword tokens — better for rare words and punctuation.
Step 3 — Next-token dataset
Goal: PyTorch Dataset that returns (input, target) pairs where target is input shifted by one position.
import torch
from torch.utils.data import Dataset
class NextTokenDataset(Dataset):
def __init__(self, ids: list[int], block_size: int = 128):
self.ids = ids
self.block_size = block_size
def __len__(self):
# One sample per starting position where a full window fits
return max(0, len(self.ids) - self.block_size - 1)
def __getitem__(self, i):
chunk = self.ids[i : i + self.block_size + 1]
x = torch.tensor(chunk[:-1], dtype=torch.long) # input: positions 0..T-1
y = torch.tensor(chunk[1:], dtype=torch.long) # target: positions 1..T
return x, yExample with block_size=4 and chunk [10, 20, 30, 40, 50]:
| Position | Input x | Target y (next token) |
|---|---|---|
| 0 | 10 | 20 |
| 1 | 20 | 30 |
| 2 | 30 | 40 |
| 3 | 40 | 50 |
The model sees up to block_size tokens — your context window (Lesson 5). Longer posts are learned in overlapping windows, not all at once.
DataLoader batches many windows:
from torch.utils.data import DataLoader
train_ds = NextTokenDataset(train_ids, block_size=128)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)shuffle=True randomizes which windows appear each epoch — better generalization on small corpora.
Step 4 — Tiny GPT model
Goal: Decoder-only stack — token + position embeddings, causal transformer layers, linear head to vocab.
import torch
import torch.nn as nn
class TinyGPT(nn.Module):
def __init__(
self,
vocab_size: int,
d_model: int = 128,
nhead: int = 4,
num_layers: int = 3,
block_size: int = 128,
):
super().__init__()
self.block_size = block_size
self.tok_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(block_size, d_model)
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=512,
batch_first=True,
)
self.blocks = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
self.lm_head = nn.Linear(d_model, vocab_size)
def forward(self, idx: torch.Tensor) -> torch.Tensor:
B, T = idx.shape
if T > self.block_size:
raise ValueError(f"Sequence length {T} > block_size {self.block_size}")
pos = torch.arange(T, device=idx.device)
x = self.tok_emb(idx) + self.pos_emb(pos)
# Causal mask: position i cannot attend to j > i
mask = nn.Transformer.generate_square_subsequent_mask(T, device=idx.device)
x = self.blocks(x, mask=mask, is_causal=True)
# Logits: one vocab-sized vector per position
return self.lm_head(x)Line-by-line intuition:
| Layer | Role (maps to lessons) |
|---|---|
tok_emb | Token ID → vector (like word embeddings, Module 4) |
pos_emb | Position index → vector (positional encoding, Lesson 3) |
TransformerEncoderLayer | Self-attention + FFN + residuals (we use causal mask for GPT behavior) |
lm_head | Hidden state → scores over entire vocabulary (next-token logits) |
Output shape: (batch, time, vocab_size) — for each position, a score per possible next token.
Parameter count (rough): with vocab≈8000, d_model=128, layers=3 → on the order of 1–3 million parameters. Document your exact count with:
sum(p.numel() for p in model.parameters())Step 5 — Training loop
Goal: Minimize cross-entropy between predicted logits and true next tokens.
import torch
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TinyGPT(vocab_size=len(vocab), block_size=128).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
criterion = torch.nn.CrossEntropyLoss()
def evaluate(loader):
model.eval()
total_loss = 0.0
n = 0
with torch.no_grad():
for x, y in loader:
x, y = x.to(device), y.to(device)
logits = model(x)
# Flatten (B,T,C) and (B,T) for cross-entropy
loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
total_loss += loss.item() * x.size(0)
n += x.size(0)
model.train()
return total_loss / max(n, 1)
for epoch in range(10):
model.train()
running = 0.0
for x, y in train_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
loss.backward()
# Stabilize training on small models / noisy grads
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
running += loss.item() * x.size(0)
train_loss = running / len(train_ds)
val_loss = evaluate(val_loader)
print(f"epoch {epoch+1:02d} train_loss={train_loss:.4f} val_loss={val_loss:.4f}")What to watch:
- Train loss should trend down — model is learning local patterns in your blog text.
- Val loss should follow; if train drops but val flatlines or rises → overfitting (Module 2) — stop earlier or shrink the model.
- Plot with matplotlib and save
loss_curve.pngfor your README.
Save weights when val loss improves:
torch.save(model.state_dict(), "checkpoints/tiny_gpt.pt")Step 6 — Generate text (inference)
Goal: Autoregressive loop — append one token at a time (Lesson 4 / Module 7 preview).
import torch
@torch.no_grad()
def generate(
model,
prompt_ids: list[int],
id_to_tok: dict,
max_new: int = 40,
temperature: float = 0.8,
):
model.eval()
ids = list(prompt_ids)
for _ in range(max_new):
# Only feed the last block_size tokens (context window limit)
window = ids[-model.block_size :]
x = torch.tensor([window], device=next(model.parameters()).device)
logits = model(x)[0, -1] # logits for LAST position only → next token
if temperature <= 0:
next_id = logits.argmax().item()
else:
probs = torch.softmax(logits / temperature, dim=-1)
next_id = torch.multinomial(probs, num_samples=1).item()
ids.append(next_id)
# Optional early stop at sentence end
if id_to_tok[next_id] in (".", "!", "?"):
break
return decode(ids)
prompt = "on device ai"
out = generate(model, encode(prompt), id_to_tok, temperature=0.8)
print(out)How generation maps to GPT chat:
- Encode prompt → token IDs.
- Model predicts distribution for next token.
- Sample (or argmax) one ID, append to sequence.
- Repeat — each new token can attend only to past tokens (causal mask at train time).
Temperature (Module 7 preview):
| Value | Behavior |
|---|---|
0 or very low | Greedy — same prompt → same continuation; safer, repetitive |
0.7–0.9 | Balanced sampling for demos |
> 1.0 | More random — creative but often nonsense on tiny models |
Run the same prompt at temperature=0.5 and 1.2 — include both in your README.
Step 7 — Optional Next.js or CLI demo
Minimum viable demo: generate.py prints to terminal.
Stretch — local API:
- Save
vocab.json+tiny_gpt.pt. - Small Flask/FastAPI server:
POST /generatewith{ "prompt": "..." }. - Next.js page with textarea + button →
fetch→ show completion.
You do not need to embed PyTorch in Next.js for this course — a local Python server is enough.
Evaluation (honest expectations)
A small model on a small blog corpus is not ChatGPT. Success means:
| Criterion | What “good” looks like |
|---|---|
| Val loss | Decreases over epochs, then plateaus |
| Generations | Occasionally uses blog-like phrases (AI, mobile, vision) |
| Understanding | You can explain corpus → tokens → causal train → sample |
| Documentation | 5 sample outputs with temperature notes |
Example README table:
| Prompt | T=0.5 (snippet) | T=1.2 (snippet) | On-topic? |
|---|---|---|---|
on device ai | … | … | yes / partial / no |
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
chars: 0 after corpus build | Wrong path to content/blog | Print root.resolve() and fix relative path |
Loss nan | Learning rate too high | Try 1e-4, keep grad clip at 1.0 |
| Loss flat, random text | Corpus too small or vocab mostly <unk> | More posts; check encode output length |
| CUDA OOM | Batch too large | Lower batch_size to 8 or 16 |
| Gibberish only | Too few epochs or tiny data | Train longer; accept limits of mini model |
Deliverables
-
data/blog_corpus.txt+data/vocab.json -
checkpoints/tiny_gpt.pt - Training loss curve image
- 5 prompted generations at two temperatures
- README: context window size, parameter count, one paragraph on causal training
What's next
Module 6 complete. Continue to Module 7 — GenAI & LLMs when ready.
Return to the AI course curriculum anytime to track progress.