← Back to curriculum

Module 6 — Transformers (core of GenAI)

Encoder vs decoder — BERT, GPT, translation

Bidirectional encoders, autoregressive decoders, cross-attention in seq2seq, and which stack powers which apps.

~70 min read + exercises

Encoder vs decoder — BERT, GPT, translation

Before we begin

Not every transformer uses the same shape. Two families dominate modern NLP — plus a third for translation-style tasks:

Encoder — read and understand (full context).
Decoder — generate next token (causal context).
Encoder–decoder — read source, write target (translation).

Figure

Three transformer families

Three transformer familiesEncoder-onlyBERTAttentionBidirectionalsees left + rightOutputLabels / embeddingsDecoder-onlyGPTAttentionCausalpast tokens onlyOutputNext-token textEnc–decoderT5 / MTAttentionSelf + crossread source, write targetOutputTranslate / summarize
Same building blocks (attention + FFN) — different attention masks and training objectives.

What you will learn

  • Contrast encoder-only, decoder-only, and encoder–decoder models.
  • Map BERT and GPT to these stacks.
  • Know when cross-attention appears and how causal masking differs from bidirectional attention.

Before this lesson


Encoder-only (BERT-style)

  • Bidirectional self-attention — every token sees left and right neighbors.
  • Great for classification, question answering, embeddings.
  • Not used directly for open-ended generation (would “see the future”).

Figure

BERT-style encoder stack

Encoder-only (BERT-style)Read full sentence → understanding / classification / embeddings[CLS] The cat sat on the matEmbeddings + positionEncoder blocks × LBidirectional self-attention + FFNevery token sees left AND rightTask head → label or vectorAttention mask(all visible)
Full sentence in → encoder blocks with bidirectional attention → task head or embedding vector.

The attention mask is all-on: every position can attend to every other position. That is why BERT can use a [CLS] token or pooled output to represent the whole sentence.

Use when: you have full text and want a label or vector — search, sentiment, semantic similarity.


Decoder-only (GPT-style)

  • Causal self-attention only — predict token t from tokens < t.
  • Trained as next-token prediction on large text corpora.
  • Generation: sample one token, append, repeat.

Figure

GPT-style decoder stack

Decoder-only (GPT-style)Predict next token left-to-right — autoregressive generationThe cat sat → ?Embeddings + positionDecoder blocks × LCausal self-attention + FFNtoken t cannot see t+1, t+2, …Linear → vocab logitsSample “on” → append → repeatCausal mask(future hidden)
Causal mask hides future tokens. Last position logits → sample next word → autoregressive loop.

The causal mask zeros out future columns (upper triangle in the diagram). Position 3 cannot peek at token 4 during training — that would be cheating on the next-token objective.

Use when: chat, completion, code generation — most GenAI products you see today.


Encoder–decoder (original translation transformer)

  • Encoder reads the full source sentence (e.g. French).
  • Decoder writes the target (e.g. English) one token at a time.
  • Cross-attention in the decoder: each English step queries encoder outputs (Keys/Values from source).

Figure

Full encoder–decoder architecture

Encoder–decoder (translation)Source sentence encoded once; decoder generates target step by stepSource (French)Le chat est assisTarget (English)The cat sat ↓EncoderEmbed + posEnc block × LContext vectorsbidirectional attnDecoderEmbed + posMasked self-attnCross-attn + FFNcausal + cross attncross-attentionQ from decoderK,V from encoderNext English token
Encoder encodes source once; decoder self-attends causally and cross-attends to encoder context.

At each decoder step:

  1. Masked self-attention — decoder tokens see only past target tokens.
  2. Cross-attention — decoder Query asks “which source words matter for this English word?”
  3. FFN — process the blended representation.

Still used in translation and some summarization systems (e.g. T5). Many new products use a large decoder-only model with the source pasted into the prompt instead.


Quick map

Model familyStackAttentionTypical task
BERTEncoderBidirectionalSentiment, search, embeddings
GPTDecoderCausalChat, writing, code
T5 / early MTEnc–DecBidirectional + causal + crossTranslate, summarize

Module 6 project choice

Your mini transformer project is decoder-only — next-word prediction on blog sentences, same training objective as GPT at toy scale.


Checkpoint

GPT is mainly a ___ model; BERT is mainly an ___ model.

Answer sketch

Decoder (autoregressive generation); encoder (bidirectional understanding).


What's next

Lesson 5 — Tokenization