← Back to curriculum

Module 6 — Transformers (core of GenAI)

Project: mini transformer on blog text

Train a small causal transformer on KinetiqVision blog MDX, predict next tokens, and generate sample sentences from a prompt.

~240 min read + exercises

Project: mini transformer — next-word prediction on blog text

Before we begin

You will train a small decoder-only transformer to predict the next token — the same core training objective as GPT, at a scale that runs on a laptop.

Corpus: plain text from this site’s blog MDX files in content/blog/. Your own writing becomes the training data, so generations can echo your blog’s topics (AI, mobile, computer vision).

Figure

Blog → train → generate

Blog MDXcorpusTraincausalPromptsentenceGeneratenext word
Extract text → tokenize → train causal transformer → sample next tokens from a prompt.

How this connects to Module 6

LessonWhat you use in this project
Attention / self-attentionEach layer mixes token vectors; causal mask blocks future tokens
Transformer architectureEmbedding + position + stacked blocks + linear head
Encoder vs decoderYou build decoder-only (GPT-style), not BERT
TokenizationWord-level vocab here (BPE optional stretch goal)

Training objective: given tokens [t₀, t₁, …, tₙ₋₁], predict [t₁, t₂, …, tₙ] — shift-by-one next-token prediction.


What you will build

  1. Extract and clean text from blog MDX files.
  2. Build a word-level tokenizer + vocabulary.
  3. Train a tiny transformer (2–4 layers, small width).
  4. Plot training loss vs epochs.
  5. Generate continuations from prompts like "on device ai" or "computer vision".
  6. Optional: CLI or simple UI to type a prompt and print output.

Estimated time: 4–6 hours.


Before you start

  • Finish the Module 6 quiz.
  • Python 3.10+ and pip install torch matplotlib

Create a project folder next to your repo (or inside it):

text
mini-transformer/
  data/           # corpus + vocab (created by scripts)
  build_corpus.py
  train.py
  generate.py
  checkpoints/    # saved model weights

From mini-transformer/, paths to blog MDX are typically ../content/blog if your repo root is the parent folder.


Step 1 — Build a text corpus

Goal: Turn MDX blog posts into one plain-text file the model can learn from.

python
# build_corpus.py
from pathlib import Path
import re
 
# Path from mini-transformer/ → repo content/blog
root = Path(__file__).resolve().parent.parent / "content" / "blog"
out = Path(__file__).resolve().parent / "data"
out.mkdir(parents=True, exist_ok=True)
 
texts = []
for path in sorted(root.glob("*.mdx")):
    raw = path.read_text(encoding="utf-8")
 
    # Remove YAML frontmatter between first --- pair
    body = re.sub(r"^---.*?---\s*", "", raw, flags=re.S)
 
    # Strip images ![alt](url) and links [text](url) — keep readable words only
    body = re.sub(r"!\[[^\]]*\]\([^)]+\)", " ", body)
    body = re.sub(r"\[[^\]]+\]\([^)]+\)", " ", body)
 
    # Drop # headings markers but keep words; remove extra whitespace
    body = re.sub(r"#+\s*", "", body)
    body = re.sub(r"\s+", " ", body).strip()
 
    if body:
        texts.append(body)
 
corpus = "\n".join(texts).lower()
(out / "blog_corpus.txt").write_text(corpus, encoding="utf-8")
print(f"Wrote {len(corpus):,} characters from {len(texts)} posts")

What each part does:

  • frontmatter regex — MDX starts with --- metadata; models should not memorize dates and titles from YAML.
  • Link/image regex — removes URLs; keeps the surrounding article text.
  • .lower() — smaller vocab ( AI and ai merge); fine for a learning project.

Run: python build_corpus.py
Expect: thousands to tens of thousands of characters depending on how many blog posts exist. If chars: 0, fix the path to content/blog.

Train / validation split (do this before training — never tune on validation):

python
# inside train.py later, or a small split helper
split_at = int(len(all_ids) * 0.9)
train_ids = all_ids[:split_at]
val_ids = all_ids[split_at:]

Hold out 10% of token IDs so you can see if loss improves on unseen text, not just memorization.


Step 2 — Tokenizer (word-level)

Goal: Map text ↔ lists of integers the neural net can process.

python
# tokenizer.py (or top of train.py)
import re
import json
from collections import Counter
from pathlib import Path
 
text = Path("data/blog_corpus.txt").read_text(encoding="utf-8")
 
# Split into words and punctuation tokens
tokens = re.findall(r"[a-z0-9]+|[^\s]", text)
counts = Counter(tokens)
 
vocab = {"<pad>": 0, "<unk>": 1}  # reserve IDs for padding and unknown words
for tok, _ in counts.most_common(8000):
    vocab[tok] = len(vocab)
 
id_to_tok = {i: t for t, i in vocab.items()}
 
def encode(s: str) -> list[int]:
    pieces = re.findall(r"[a-z0-9]+|[^\s]", s.lower())
    return [vocab.get(t, vocab["<unk>"]) for t in pieces]
 
def decode(ids: list[int]) -> str:
    return " ".join(id_to_tok[i] for i in ids)
 
Path("data/vocab.json").write_text(json.dumps(vocab), encoding="utf-8")
print("vocab size:", len(vocab))
all_ids = encode(text)

What each part does:

PieceMeaning
re.findall(...)"on-device"["on", "-", "device"] — simple but works for learning
<unk>Words not in top 8k map to ID 1 — rare words won’t crash training
encode / decodeSame interface real tokenizers use (string ↔ ID list)
all_idsEntire corpus as one long integer sequence for the dataset

Stretch: swap in tiktoken (Lesson 5) for subword tokens — better for rare words and punctuation.


Step 3 — Next-token dataset

Goal: PyTorch Dataset that returns (input, target) pairs where target is input shifted by one position.

python
import torch
from torch.utils.data import Dataset
 
class NextTokenDataset(Dataset):
    def __init__(self, ids: list[int], block_size: int = 128):
        self.ids = ids
        self.block_size = block_size
 
    def __len__(self):
        # One sample per starting position where a full window fits
        return max(0, len(self.ids) - self.block_size - 1)
 
    def __getitem__(self, i):
        chunk = self.ids[i : i + self.block_size + 1]
        x = torch.tensor(chunk[:-1], dtype=torch.long)   # input:  positions 0..T-1
        y = torch.tensor(chunk[1:], dtype=torch.long)    # target: positions 1..T
        return x, y

Example with block_size=4 and chunk [10, 20, 30, 40, 50]:

PositionInput xTarget y (next token)
01020
12030
23040
34050

The model sees up to block_size tokens — your context window (Lesson 5). Longer posts are learned in overlapping windows, not all at once.

DataLoader batches many windows:

python
from torch.utils.data import DataLoader
 
train_ds = NextTokenDataset(train_ids, block_size=128)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)

shuffle=True randomizes which windows appear each epoch — better generalization on small corpora.


Step 4 — Tiny GPT model

Goal: Decoder-only stack — token + position embeddings, causal transformer layers, linear head to vocab.

python
import torch
import torch.nn as nn
 
class TinyGPT(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        d_model: int = 128,
        nhead: int = 4,
        num_layers: int = 3,
        block_size: int = 128,
    ):
        super().__init__()
        self.block_size = block_size
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(block_size, d_model)
 
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=512,
            batch_first=True,
        )
        self.blocks = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.lm_head = nn.Linear(d_model, vocab_size)
 
    def forward(self, idx: torch.Tensor) -> torch.Tensor:
        B, T = idx.shape
        if T > self.block_size:
            raise ValueError(f"Sequence length {T} > block_size {self.block_size}")
 
        pos = torch.arange(T, device=idx.device)
        x = self.tok_emb(idx) + self.pos_emb(pos)
 
        # Causal mask: position i cannot attend to j > i
        mask = nn.Transformer.generate_square_subsequent_mask(T, device=idx.device)
        x = self.blocks(x, mask=mask, is_causal=True)
 
        # Logits: one vocab-sized vector per position
        return self.lm_head(x)

Line-by-line intuition:

LayerRole (maps to lessons)
tok_embToken ID → vector (like word embeddings, Module 4)
pos_embPosition index → vector (positional encoding, Lesson 3)
TransformerEncoderLayerSelf-attention + FFN + residuals (we use causal mask for GPT behavior)
lm_headHidden state → scores over entire vocabulary (next-token logits)

Output shape: (batch, time, vocab_size) — for each position, a score per possible next token.

Parameter count (rough): with vocab≈8000, d_model=128, layers=3 → on the order of 1–3 million parameters. Document your exact count with:

python
sum(p.numel() for p in model.parameters())

Step 5 — Training loop

Goal: Minimize cross-entropy between predicted logits and true next tokens.

python
import torch
from torch.utils.data import DataLoader
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TinyGPT(vocab_size=len(vocab), block_size=128).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
criterion = torch.nn.CrossEntropyLoss()
 
def evaluate(loader):
    model.eval()
    total_loss = 0.0
    n = 0
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            # Flatten (B,T,C) and (B,T) for cross-entropy
            loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
            total_loss += loss.item() * x.size(0)
            n += x.size(0)
    model.train()
    return total_loss / max(n, 1)
 
for epoch in range(10):
    model.train()
    running = 0.0
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
 
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
        loss.backward()
 
        # Stabilize training on small models / noisy grads
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
 
        running += loss.item() * x.size(0)
 
    train_loss = running / len(train_ds)
    val_loss = evaluate(val_loader)
    print(f"epoch {epoch+1:02d}  train_loss={train_loss:.4f}  val_loss={val_loss:.4f}")

What to watch:

  • Train loss should trend down — model is learning local patterns in your blog text.
  • Val loss should follow; if train drops but val flatlines or rises → overfitting (Module 2) — stop earlier or shrink the model.
  • Plot with matplotlib and save loss_curve.png for your README.

Save weights when val loss improves:

python
torch.save(model.state_dict(), "checkpoints/tiny_gpt.pt")

Step 6 — Generate text (inference)

Goal: Autoregressive loop — append one token at a time (Lesson 4 / Module 7 preview).

python
import torch
 
@torch.no_grad()
def generate(
    model,
    prompt_ids: list[int],
    id_to_tok: dict,
    max_new: int = 40,
    temperature: float = 0.8,
):
    model.eval()
    ids = list(prompt_ids)
 
    for _ in range(max_new):
        # Only feed the last block_size tokens (context window limit)
        window = ids[-model.block_size :]
        x = torch.tensor([window], device=next(model.parameters()).device)
 
        logits = model(x)[0, -1]  # logits for LAST position only → next token
 
        if temperature <= 0:
            next_id = logits.argmax().item()
        else:
            probs = torch.softmax(logits / temperature, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1).item()
 
        ids.append(next_id)
 
        # Optional early stop at sentence end
        if id_to_tok[next_id] in (".", "!", "?"):
            break
 
    return decode(ids)
 
prompt = "on device ai"
out = generate(model, encode(prompt), id_to_tok, temperature=0.8)
print(out)

How generation maps to GPT chat:

  1. Encode prompt → token IDs.
  2. Model predicts distribution for next token.
  3. Sample (or argmax) one ID, append to sequence.
  4. Repeat — each new token can attend only to past tokens (causal mask at train time).

Temperature (Module 7 preview):

ValueBehavior
0 or very lowGreedy — same prompt → same continuation; safer, repetitive
0.7–0.9Balanced sampling for demos
> 1.0More random — creative but often nonsense on tiny models

Run the same prompt at temperature=0.5 and 1.2 — include both in your README.


Step 7 — Optional Next.js or CLI demo

Minimum viable demo: generate.py prints to terminal.

Stretch — local API:

  1. Save vocab.json + tiny_gpt.pt.
  2. Small Flask/FastAPI server: POST /generate with { "prompt": "..." }.
  3. Next.js page with textarea + button → fetch → show completion.

You do not need to embed PyTorch in Next.js for this course — a local Python server is enough.


Evaluation (honest expectations)

A small model on a small blog corpus is not ChatGPT. Success means:

CriterionWhat “good” looks like
Val lossDecreases over epochs, then plateaus
GenerationsOccasionally uses blog-like phrases (AI, mobile, vision)
UnderstandingYou can explain corpus → tokens → causal train → sample
Documentation5 sample outputs with temperature notes

Example README table:

PromptT=0.5 (snippet)T=1.2 (snippet)On-topic?
on device aiyes / partial / no

Troubleshooting

SymptomLikely causeFix
chars: 0 after corpus buildWrong path to content/blogPrint root.resolve() and fix relative path
Loss nanLearning rate too highTry 1e-4, keep grad clip at 1.0
Loss flat, random textCorpus too small or vocab mostly <unk>More posts; check encode output length
CUDA OOMBatch too largeLower batch_size to 8 or 16
Gibberish onlyToo few epochs or tiny dataTrain longer; accept limits of mini model

Deliverables

  • data/blog_corpus.txt + data/vocab.json
  • checkpoints/tiny_gpt.pt
  • Training loss curve image
  • 5 prompted generations at two temperatures
  • README: context window size, parameter count, one paragraph on causal training

What's next

Module 6 complete. Continue to Module 7 — GenAI & LLMs when ready.

Return to the AI course curriculum anytime to track progress.