Transformer architecture

Before we begin

A transformer stacks identical blocks. Each block refines token representations using attention and a small feed-forward network.

Focus on data flow, not on deriving every equation.

Figure

Full transformer stack

Token IDs → embeddings + position → L identical blocks → logits → next token. GPT-style decoders follow this path.

What you will learn

Trace data from token IDs to vocab logits.
Name every part of a transformer block.
Explain residual connections and layer normalization at a high level.
Describe positional encoding and why it is needed.

Before this lesson

Lesson 2 — Self-attention

Full stack (walk the diagram)

Read the diagram top to bottom:

Step	Component	What it does
1	Token IDs	Integers from the tokenizer — one per word/subword
2	Token embedding	Lookup table: ID → dense vector
3	+ Positional encoding	Adds order signal — attention alone is permutation-invariant
4	Transformer block × L	Same sublayers repeated — depth builds richer context
5	Final layer norm	Stabilizes activations before the head (common in GPT-style models)
6	Linear head	Projects last position (or all positions) to vocab size logits
7	Softmax / sample	Pick next token — greedy or temperature sampling (Module 7)

The dashed box in the diagram means “repeat this block L times” — e.g. L = 6 in a toy model, L = 80+ in large LLMs. Width (hidden size) and L together drive parameter count.

Inside one block (zoom in)

Figure

One transformer layer

Attention mixes tokens across the sequence; FFN processes each position; residuals + norm after each sublayer.

Each block has two sublayers:

Multi-head self-attention — every token gathers context from other tokens (Lesson 2).
Position-wise feed-forward (MLP) — same two-layer network applied independently at each position.

After each sublayer: residual add then layer norm (Post-LN in the original paper; many modern models use Pre-LN — same parts, different order).

The yellow dashed arcs in the full diagram are residual (skip) connections:

$\text{output} = \text{LayerNorm}(\text{sublayer}(x) + x)$

The block keeps the old representation and adds a correction — critical for training deep stacks.

Feed-forward sublayer

Applied independently at each token position — same MLP weights everywhere, but different inputs per position.

Typical pattern:

Expand: hidden 512 → 2048
Activation (e.g. GELU)
Project back: 2048 → 512

Attention mixes tokens; FFN transforms each mixed vector — both are needed.

Layer normalization

Stabilizes activations — normalize across the feature dimension per token. Without norm, deep transformers are hard to train (activations explode or vanish).

You will see LayerNorm boxes labeled “Add & norm” in the block diagram.

Positional encoding

Self-attention alone treats tokens as a bag — swapping order gives the same pairwise scores.

Fix: add a position signal (sinusoidal or learned) to each embedding so the model knows word order.

Modern LLMs use variants like RoPE (rotary position embedding) — same purpose, often better extrapolation to lengths longer than training.

Parameter scale intuition

Model size	Rough idea
Module 6 project	Thousands–millions of params — trains on a laptop
BERT-base	~110M
GPT-2	117M – 1.5B
GPT-3+	Billions

Same block diagram — wider embeddings, larger FFN, more layers L. Understanding one small stack maps directly to ChatGPT-scale systems.

Checkpoint

Can you point to each labeled box in the full stack diagram and say what it does in one sentence?

What's next

Lesson 4 — Encoder vs decoder — how BERT (encoder) and GPT (decoder) differ from this stack.