Encoder vs decoder — BERT, GPT, translation
Before we begin
Not every transformer uses the same shape. Two families dominate modern NLP — plus a third for translation-style tasks:
Encoder — read and understand (full context).
Decoder — generate next token (causal context).
Encoder–decoder — read source, write target (translation).
Figure
Three transformer families
What you will learn
- Contrast encoder-only, decoder-only, and encoder–decoder models.
- Map BERT and GPT to these stacks.
- Know when cross-attention appears and how causal masking differs from bidirectional attention.
Before this lesson
Encoder-only (BERT-style)
- Bidirectional self-attention — every token sees left and right neighbors.
- Great for classification, question answering, embeddings.
- Not used directly for open-ended generation (would “see the future”).
Figure
BERT-style encoder stack
The attention mask is all-on: every position can attend to every other position. That is why BERT can use a [CLS] token or pooled output to represent the whole sentence.
Use when: you have full text and want a label or vector — search, sentiment, semantic similarity.
Decoder-only (GPT-style)
- Causal self-attention only — predict token t from tokens < t.
- Trained as next-token prediction on large text corpora.
- Generation: sample one token, append, repeat.
Figure
GPT-style decoder stack
The causal mask zeros out future columns (upper triangle in the diagram). Position 3 cannot peek at token 4 during training — that would be cheating on the next-token objective.
Use when: chat, completion, code generation — most GenAI products you see today.
Encoder–decoder (original translation transformer)
- Encoder reads the full source sentence (e.g. French).
- Decoder writes the target (e.g. English) one token at a time.
- Cross-attention in the decoder: each English step queries encoder outputs (Keys/Values from source).
Figure
Full encoder–decoder architecture
At each decoder step:
- Masked self-attention — decoder tokens see only past target tokens.
- Cross-attention — decoder Query asks “which source words matter for this English word?”
- FFN — process the blended representation.
Still used in translation and some summarization systems (e.g. T5). Many new products use a large decoder-only model with the source pasted into the prompt instead.
Quick map
| Model family | Stack | Attention | Typical task |
|---|---|---|---|
| BERT | Encoder | Bidirectional | Sentiment, search, embeddings |
| GPT | Decoder | Causal | Chat, writing, code |
| T5 / early MT | Enc–Dec | Bidirectional + causal + cross | Translate, summarize |
Module 6 project choice
Your mini transformer project is decoder-only — next-word prediction on blog sentences, same training objective as GPT at toy scale.
Checkpoint
GPT is mainly a ___ model; BERT is mainly an ___ model.
Answer sketch
Decoder (autoregressive generation); encoder (bidirectional understanding).