← Back to curriculum

Module 6 — Transformers (core of GenAI)

Transformer architecture

Encoder blocks, feed-forward sublayers, residuals, layer norm, positional encoding — without heavy matrix calculus.

~75 min read + exercises

Transformer architecture

Before we begin

A transformer stacks identical blocks. Each block refines token representations using attention and a small feed-forward network.

Focus on data flow, not on deriving every equation.

Figure

Full transformer stack

Full transformer (decoder / GPT-style)Data flows top → bottom. Same block repeated L times.Token IDs[The, cat, sat, …]Token embeddingID → vector+ Positional encodinginject order× LlayersMulti-head self-attentionAdd & normFeed-forward (MLP)Add & norm+ residualFinal layer normoptionalLinear head→ vocab logitsSoftmax / samplenext tokenEach layer:1. Mix tokens (attention)2. Process each position (FFN)— skip adds input back
Token IDs → embeddings + position → L identical blocks → logits → next token. GPT-style decoders follow this path.

What you will learn

  • Trace data from token IDs to vocab logits.
  • Name every part of a transformer block.
  • Explain residual connections and layer normalization at a high level.
  • Describe positional encoding and why it is needed.

Before this lesson


Full stack (walk the diagram)

Read the diagram top to bottom:

StepComponentWhat it does
1Token IDsIntegers from the tokenizer — one per word/subword
2Token embeddingLookup table: ID → dense vector
3+ Positional encodingAdds order signal — attention alone is permutation-invariant
4Transformer block × LSame sublayers repeated — depth builds richer context
5Final layer normStabilizes activations before the head (common in GPT-style models)
6Linear headProjects last position (or all positions) to vocab size logits
7Softmax / samplePick next token — greedy or temperature sampling (Module 7)

The dashed box in the diagram means “repeat this block L times” — e.g. L = 6 in a toy model, L = 80+ in large LLMs. Width (hidden size) and L together drive parameter count.


Inside one block (zoom in)

Figure

One transformer layer

One transformer layerMulti-head self-attentionAdd & normFeed-forward (MLP)Add & norm
Attention mixes tokens across the sequence; FFN processes each position; residuals + norm after each sublayer.

Each block has two sublayers:

  1. Multi-head self-attention — every token gathers context from other tokens (Lesson 2).
  2. Position-wise feed-forward (MLP) — same two-layer network applied independently at each position.

After each sublayer: residual add then layer norm (Post-LN in the original paper; many modern models use Pre-LN — same parts, different order).

The yellow dashed arcs in the full diagram are residual (skip) connections:

output=LayerNorm(sublayer(x)+x)\text{output} = \text{LayerNorm}(\text{sublayer}(x) + x)

The block keeps the old representation and adds a correction — critical for training deep stacks.


Feed-forward sublayer

Applied independently at each token position — same MLP weights everywhere, but different inputs per position.

Typical pattern:

  • Expand: hidden 512 → 2048
  • Activation (e.g. GELU)
  • Project back: 2048 → 512

Attention mixes tokens; FFN transforms each mixed vector — both are needed.


Layer normalization

Stabilizes activations — normalize across the feature dimension per token. Without norm, deep transformers are hard to train (activations explode or vanish).

You will see LayerNorm boxes labeled “Add & norm” in the block diagram.


Positional encoding

Self-attention alone treats tokens as a bag — swapping order gives the same pairwise scores.

Fix: add a position signal (sinusoidal or learned) to each embedding so the model knows word order.

Modern LLMs use variants like RoPE (rotary position embedding) — same purpose, often better extrapolation to lengths longer than training.


Parameter scale intuition

Model sizeRough idea
Module 6 projectThousands–millions of params — trains on a laptop
BERT-base~110M
GPT-2117M – 1.5B
GPT-3+Billions

Same block diagram — wider embeddings, larger FFN, more layers L. Understanding one small stack maps directly to ChatGPT-scale systems.


Checkpoint

Can you point to each labeled box in the full stack diagram and say what it does in one sentence?


What's next

Lesson 4 — Encoder vs decoder — how BERT (encoder) and GPT (decoder) differ from this stack.