Project: mini transformer — next-word prediction on blog text

Before we begin

You will train a small decoder-only transformer to predict the next token — the same core training objective as GPT, at a scale that runs on a laptop.

Corpus: plain text from this site’s blog MDX files in content/blog/. Your own writing becomes the training data, so generations can echo your blog’s topics (AI, mobile, computer vision).

Figure

Blog → train → generate

Extract text → tokenize → train causal transformer → sample next tokens from a prompt.

How this connects to Module 6

Lesson	What you use in this project
Attention / self-attention	Each layer mixes token vectors; causal mask blocks future tokens
Transformer architecture	Embedding + position + stacked blocks + linear head
Encoder vs decoder	You build decoder-only (GPT-style), not BERT
Tokenization	Word-level vocab here (BPE optional stretch goal)

Training objective: given tokens [t₀, t₁, …, tₙ₋₁], predict [t₁, t₂, …, tₙ] — shift-by-one next-token prediction.

What you will build

Extract and clean text from blog MDX files.
Build a word-level tokenizer + vocabulary.
Train a tiny transformer (2–4 layers, small width).
Plot training loss vs epochs.
Generate continuations from prompts like "on device ai" or "computer vision".
Optional: CLI or simple UI to type a prompt and print output.

Estimated time: 4–6 hours.

Before you start

Finish the Module 6 quiz.
Python 3.10+ and pip install torch matplotlib

Create a project folder next to your repo (or inside it):

text

mini-transformer/
  data/           # corpus + vocab (created by scripts)
  build_corpus.py
  train.py
  generate.py
  checkpoints/    # saved model weights

From mini-transformer/, paths to blog MDX are typically ../content/blog if your repo root is the parent folder.

Step 1 — Build a text corpus

Goal: Turn MDX blog posts into one plain-text file the model can learn from.

python

# build_corpus.py
from pathlib import Path
import re
 
# Path from mini-transformer/ → repo content/blog
root = Path(__file__).resolve().parent.parent / "content" / "blog"
out = Path(__file__).resolve().parent / "data"
out.mkdir(parents=True, exist_ok=True)
 
texts = []
for path in sorted(root.glob("*.mdx")):
    raw = path.read_text(encoding="utf-8")
 
    # Remove YAML frontmatter between first --- pair
    body = re.sub(r"^---.*?---\s*", "", raw, flags=re.S)
 
    # Strip images ![alt](url) and links [text](url) — keep readable words only
    body = re.sub(r"!\[[^\]]*\]\([^)]+\)", " ", body)
    body = re.sub(r"\[[^\]]+\]\([^)]+\)", " ", body)
 
    # Drop # headings markers but keep words; remove extra whitespace
    body = re.sub(r"#+\s*", "", body)
    body = re.sub(r"\s+", " ", body).strip()
 
    if body:
        texts.append(body)
 
corpus = "\n".join(texts).lower()
(out / "blog_corpus.txt").write_text(corpus, encoding="utf-8")
print(f"Wrote {len(corpus):,} characters from {len(texts)} posts")

What each part does:

frontmatter regex — MDX starts with --- metadata; models should not memorize dates and titles from YAML.
Link/image regex — removes URLs; keeps the surrounding article text.
.lower() — smaller vocab ( AI and ai merge); fine for a learning project.

Run: python build_corpus.py
Expect: thousands to tens of thousands of characters depending on how many blog posts exist. If chars: 0, fix the path to content/blog.

Train / validation split (do this before training — never tune on validation):

python

# inside train.py later, or a small split helper
split_at = int(len(all_ids) * 0.9)
train_ids = all_ids[:split_at]
val_ids = all_ids[split_at:]

Hold out 10% of token IDs so you can see if loss improves on unseen text, not just memorization.

Step 2 — Tokenizer (word-level)

Goal: Map text ↔ lists of integers the neural net can process.

python

# tokenizer.py (or top of train.py)
import re
import json
from collections import Counter
from pathlib import Path
 
text = Path("data/blog_corpus.txt").read_text(encoding="utf-8")
 
# Split into words and punctuation tokens
tokens = re.findall(r"[a-z0-9]+|[^\s]", text)
counts = Counter(tokens)
 
vocab = {"<pad>": 0, "<unk>": 1}  # reserve IDs for padding and unknown words
for tok, _ in counts.most_common(8000):
    vocab[tok] = len(vocab)
 
id_to_tok = {i: t for t, i in vocab.items()}
 
def encode(s: str) -> list[int]:
    pieces = re.findall(r"[a-z0-9]+|[^\s]", s.lower())
    return [vocab.get(t, vocab["<unk>"]) for t in pieces]
 
def decode(ids: list[int]) -> str:
    return " ".join(id_to_tok[i] for i in ids)
 
Path("data/vocab.json").write_text(json.dumps(vocab), encoding="utf-8")
print("vocab size:", len(vocab))
all_ids = encode(text)

What each part does:

Piece	Meaning
`re.findall(...)`	`"on-device"` → `["on", "-", "device"]` — simple but works for learning
`<unk>`	Words not in top 8k map to ID 1 — rare words won’t crash training
`encode` / `decode`	Same interface real tokenizers use (string ↔ ID list)
`all_ids`	Entire corpus as one long integer sequence for the dataset

Stretch: swap in tiktoken (Lesson 5) for subword tokens — better for rare words and punctuation.

Step 3 — Next-token dataset

Goal: PyTorch Dataset that returns (input, target) pairs where target is input shifted by one position.

python

import torch
from torch.utils.data import Dataset
 
class NextTokenDataset(Dataset):
    def __init__(self, ids: list[int], block_size: int = 128):
        self.ids = ids
        self.block_size = block_size
 
    def __len__(self):
        # One sample per starting position where a full window fits
        return max(0, len(self.ids) - self.block_size - 1)
 
    def __getitem__(self, i):
        chunk = self.ids[i : i + self.block_size + 1]
        x = torch.tensor(chunk[:-1], dtype=torch.long)   # input:  positions 0..T-1
        y = torch.tensor(chunk[1:], dtype=torch.long)    # target: positions 1..T
        return x, y

Example with block_size=4 and chunk [10, 20, 30, 40, 50]:

Position	Input `x`	Target `y` (next token)
0	10	20
1	20	30
2	30	40
3	40	50

The model sees up to block_size tokens — your context window (Lesson 5). Longer posts are learned in overlapping windows, not all at once.

DataLoader batches many windows:

python

from torch.utils.data import DataLoader
 
train_ds = NextTokenDataset(train_ids, block_size=128)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)

shuffle=True randomizes which windows appear each epoch — better generalization on small corpora.

Step 4 — Tiny GPT model

Goal: Decoder-only stack — token + position embeddings, causal transformer layers, linear head to vocab.

python

import torch
import torch.nn as nn
 
class TinyGPT(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        d_model: int = 128,
        nhead: int = 4,
        num_layers: int = 3,
        block_size: int = 128,
    ):
        super().__init__()
        self.block_size = block_size
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(block_size, d_model)
 
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=512,
            batch_first=True,
        )
        self.blocks = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.lm_head = nn.Linear(d_model, vocab_size)
 
    def forward(self, idx: torch.Tensor) -> torch.Tensor:
        B, T = idx.shape
        if T > self.block_size:
            raise ValueError(f"Sequence length {T} > block_size {self.block_size}")
 
        pos = torch.arange(T, device=idx.device)
        x = self.tok_emb(idx) + self.pos_emb(pos)
 
        # Causal mask: position i cannot attend to j > i
        mask = nn.Transformer.generate_square_subsequent_mask(T, device=idx.device)
        x = self.blocks(x, mask=mask, is_causal=True)
 
        # Logits: one vocab-sized vector per position
        return self.lm_head(x)

Line-by-line intuition:

Layer	Role (maps to lessons)
`tok_emb`	Token ID → vector (like word embeddings, Module 4)
`pos_emb`	Position index → vector (positional encoding, Lesson 3)
`TransformerEncoderLayer`	Self-attention + FFN + residuals (we use causal mask for GPT behavior)
`lm_head`	Hidden state → scores over entire vocabulary (next-token logits)

Output shape: (batch, time, vocab_size) — for each position, a score per possible next token.

Parameter count (rough): with vocab≈8000, d_model=128, layers=3 → on the order of 1–3 million parameters. Document your exact count with:

python

sum(p.numel() for p in model.parameters())

Step 5 — Training loop

Goal: Minimize cross-entropy between predicted logits and true next tokens.

python

import torch
from torch.utils.data import DataLoader
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TinyGPT(vocab_size=len(vocab), block_size=128).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
criterion = torch.nn.CrossEntropyLoss()
 
def evaluate(loader):
    model.eval()
    total_loss = 0.0
    n = 0
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            # Flatten (B,T,C) and (B,T) for cross-entropy
            loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
            total_loss += loss.item() * x.size(0)
            n += x.size(0)
    model.train()
    return total_loss / max(n, 1)
 
for epoch in range(10):
    model.train()
    running = 0.0
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
 
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
        loss.backward()
 
        # Stabilize training on small models / noisy grads
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
 
        running += loss.item() * x.size(0)
 
    train_loss = running / len(train_ds)
    val_loss = evaluate(val_loader)
    print(f"epoch {epoch+1:02d}  train_loss={train_loss:.4f}  val_loss={val_loss:.4f}")

What to watch:

Train loss should trend down — model is learning local patterns in your blog text.
Val loss should follow; if train drops but val flatlines or rises → overfitting (Module 2) — stop earlier or shrink the model.
Plot with matplotlib and save loss_curve.png for your README.

Save weights when val loss improves:

python

torch.save(model.state_dict(), "checkpoints/tiny_gpt.pt")

Step 6 — Generate text (inference)

Goal: Autoregressive loop — append one token at a time (Lesson 4 / Module 7 preview).

python

import torch
 
@torch.no_grad()
def generate(
    model,
    prompt_ids: list[int],
    id_to_tok: dict,
    max_new: int = 40,
    temperature: float = 0.8,
):
    model.eval()
    ids = list(prompt_ids)
 
    for _ in range(max_new):
        # Only feed the last block_size tokens (context window limit)
        window = ids[-model.block_size :]
        x = torch.tensor([window], device=next(model.parameters()).device)
 
        logits = model(x)[0, -1]  # logits for LAST position only → next token
 
        if temperature <= 0:
            next_id = logits.argmax().item()
        else:
            probs = torch.softmax(logits / temperature, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1).item()
 
        ids.append(next_id)
 
        # Optional early stop at sentence end
        if id_to_tok[next_id] in (".", "!", "?"):
            break
 
    return decode(ids)
 
prompt = "on device ai"
out = generate(model, encode(prompt), id_to_tok, temperature=0.8)
print(out)

How generation maps to GPT chat:

Encode prompt → token IDs.
Model predicts distribution for next token.
Sample (or argmax) one ID, append to sequence.
Repeat — each new token can attend only to past tokens (causal mask at train time).

Temperature (Module 7 preview):

Value	Behavior
`0` or very low	Greedy — same prompt → same continuation; safer, repetitive
`0.7–0.9`	Balanced sampling for demos
`> 1.0`	More random — creative but often nonsense on tiny models

Run the same prompt at temperature=0.5 and 1.2 — include both in your README.

Step 7 — Optional Next.js or CLI demo

Minimum viable demo: generate.py prints to terminal.

Stretch — local API:

Save vocab.json + tiny_gpt.pt.
Small Flask/FastAPI server: POST /generate with { "prompt": "..." }.
Next.js page with textarea + button → fetch → show completion.

You do not need to embed PyTorch in Next.js for this course — a local Python server is enough.

Evaluation (honest expectations)

A small model on a small blog corpus is not ChatGPT. Success means:

Criterion	What “good” looks like
Val loss	Decreases over epochs, then plateaus
Generations	Occasionally uses blog-like phrases (AI, mobile, vision)
Understanding	You can explain corpus → tokens → causal train → sample
Documentation	5 sample outputs with temperature notes

Example README table:

Prompt	T=0.5 (snippet)	T=1.2 (snippet)	On-topic?
`on device ai`	…	…	yes / partial / no

Troubleshooting

Symptom	Likely cause	Fix
`chars: 0` after corpus build	Wrong path to `content/blog`	Print `root.resolve()` and fix relative path
Loss `nan`	Learning rate too high	Try `1e-4`, keep grad clip at 1.0
Loss flat, random text	Corpus too small or vocab mostly `<unk>`	More posts; check `encode` output length
CUDA OOM	Batch too large	Lower `batch_size` to 8 or 16
Gibberish only	Too few epochs or tiny data	Train longer; accept limits of mini model

Deliverables

data/blog_corpus.txt + data/vocab.json
checkpoints/tiny_gpt.pt
Training loss curve image
5 prompted generations at two temperatures
README: context window size, parameter count, one paragraph on causal training

What's next

Module 6 complete. Continue to Module 7 — GenAI & LLMs when ready.

Return to the AI course curriculum anytime to track progress.