Encoder vs decoder — BERT, GPT, translation

Before we begin

Not every transformer uses the same shape. Two families dominate modern NLP — plus a third for translation-style tasks:

Encoder — read and understand (full context).
Decoder — generate next token (causal context).
Encoder–decoder — read source, write target (translation).

Figure

Three transformer families

Same building blocks (attention + FFN) — different attention masks and training objectives.

What you will learn

Contrast encoder-only, decoder-only, and encoder–decoder models.
Map BERT and GPT to these stacks.
Know when cross-attention appears and how causal masking differs from bidirectional attention.

Before this lesson

Lesson 3 — Transformer architecture

Encoder-only (BERT-style)

Bidirectional self-attention — every token sees left and right neighbors.
Great for classification, question answering, embeddings.
Not used directly for open-ended generation (would “see the future”).

Figure

BERT-style encoder stack

Full sentence in → encoder blocks with bidirectional attention → task head or embedding vector.

The attention mask is all-on: every position can attend to every other position. That is why BERT can use a [CLS] token or pooled output to represent the whole sentence.

Use when: you have full text and want a label or vector — search, sentiment, semantic similarity.

Decoder-only (GPT-style)

Causal self-attention only — predict token t from tokens < t.
Trained as next-token prediction on large text corpora.
Generation: sample one token, append, repeat.

Figure

GPT-style decoder stack

Causal mask hides future tokens. Last position logits → sample next word → autoregressive loop.

The causal mask zeros out future columns (upper triangle in the diagram). Position 3 cannot peek at token 4 during training — that would be cheating on the next-token objective.

Use when: chat, completion, code generation — most GenAI products you see today.

Encoder–decoder (original translation transformer)

Encoder reads the full source sentence (e.g. French).
Decoder writes the target (e.g. English) one token at a time.
Cross-attention in the decoder: each English step queries encoder outputs (Keys/Values from source).

Figure

Full encoder–decoder architecture

Encoder encodes source once; decoder self-attends causally and cross-attends to encoder context.

At each decoder step:

Masked self-attention — decoder tokens see only past target tokens.
Cross-attention — decoder Query asks “which source words matter for this English word?”
FFN — process the blended representation.

Still used in translation and some summarization systems (e.g. T5). Many new products use a large decoder-only model with the source pasted into the prompt instead.

Quick map

Model family	Stack	Attention	Typical task
BERT	Encoder	Bidirectional	Sentiment, search, embeddings
GPT	Decoder	Causal	Chat, writing, code
T5 / early MT	Enc–Dec	Bidirectional + causal + cross	Translate, summarize

Module 6 project choice

Your mini transformer project is decoder-only — next-word prediction on blog sentences, same training objective as GPT at toy scale.

Checkpoint

GPT is mainly a ___ model; BERT is mainly an ___ model.

Answer sketch

Decoder (autoregressive generation); encoder (bidirectional understanding).

What's next

Lesson 5 — Tokenization