Transformer architecture
Before we begin
A transformer stacks identical blocks. Each block refines token representations using attention and a small feed-forward network.
Focus on data flow, not on deriving every equation.
Figure
Full transformer stack
What you will learn
- Trace data from token IDs to vocab logits.
- Name every part of a transformer block.
- Explain residual connections and layer normalization at a high level.
- Describe positional encoding and why it is needed.
Before this lesson
Full stack (walk the diagram)
Read the diagram top to bottom:
| Step | Component | What it does |
|---|---|---|
| 1 | Token IDs | Integers from the tokenizer — one per word/subword |
| 2 | Token embedding | Lookup table: ID → dense vector |
| 3 | + Positional encoding | Adds order signal — attention alone is permutation-invariant |
| 4 | Transformer block × L | Same sublayers repeated — depth builds richer context |
| 5 | Final layer norm | Stabilizes activations before the head (common in GPT-style models) |
| 6 | Linear head | Projects last position (or all positions) to vocab size logits |
| 7 | Softmax / sample | Pick next token — greedy or temperature sampling (Module 7) |
The dashed box in the diagram means “repeat this block L times” — e.g. L = 6 in a toy model, L = 80+ in large LLMs. Width (hidden size) and L together drive parameter count.
Inside one block (zoom in)
Figure
One transformer layer
Each block has two sublayers:
- Multi-head self-attention — every token gathers context from other tokens (Lesson 2).
- Position-wise feed-forward (MLP) — same two-layer network applied independently at each position.
After each sublayer: residual add then layer norm (Post-LN in the original paper; many modern models use Pre-LN — same parts, different order).
The yellow dashed arcs in the full diagram are residual (skip) connections:
The block keeps the old representation and adds a correction — critical for training deep stacks.
Feed-forward sublayer
Applied independently at each token position — same MLP weights everywhere, but different inputs per position.
Typical pattern:
- Expand: hidden 512 → 2048
- Activation (e.g. GELU)
- Project back: 2048 → 512
Attention mixes tokens; FFN transforms each mixed vector — both are needed.
Layer normalization
Stabilizes activations — normalize across the feature dimension per token. Without norm, deep transformers are hard to train (activations explode or vanish).
You will see LayerNorm boxes labeled “Add & norm” in the block diagram.
Positional encoding
Self-attention alone treats tokens as a bag — swapping order gives the same pairwise scores.
Fix: add a position signal (sinusoidal or learned) to each embedding so the model knows word order.
Modern LLMs use variants like RoPE (rotary position embedding) — same purpose, often better extrapolation to lengths longer than training.
Parameter scale intuition
| Model size | Rough idea |
|---|---|
| Module 6 project | Thousands–millions of params — trains on a laptop |
| BERT-base | ~110M |
| GPT-2 | 117M – 1.5B |
| GPT-3+ | Billions |
Same block diagram — wider embeddings, larger FFN, more layers L. Understanding one small stack maps directly to ChatGPT-scale systems.
Checkpoint
Can you point to each labeled box in the full stack diagram and say what it does in one sentence?
What's next
Lesson 4 — Encoder vs decoder — how BERT (encoder) and GPT (decoder) differ from this stack.