← Back to curriculum

Module 6 — Transformers (core of GenAI)

Tokenization & context window

Tokens vs words, BPE subwords, vocabulary, padding, truncation, and context length limits in production.

~65 min read + exercises

Tokenization and context window

Before we begin

Models never see raw strings — they see token IDs. Tokenization splits text into pieces the vocabulary knows.

What is a token? A unit of text (word, subword, or byte chunk) mapped to an integer the model embeds.

Figure

Text → tokens → embeddings

TextstringTokenizerBPEToken IDs[42, 7…]Embeddingsvectors
Tokenizer runs before the transformer stack.

What you will learn

  • Define tokens, vocabulary, and IDs.
  • Explain subword tokenization (BPE).
  • Describe the context window and truncation.

Before this lesson


Word-level vs subword

Word-level: one ID per dictionary word — huge vocab, many <unk> unknowns.

Subword (BPE, WordPiece, SentencePiece): frequent words stay whole; rare words split:

  • "transformers""transform" + "ers"
  • "unhappiness""un" + "happiness"

Smaller vocab, fewer unknowns — standard for LLMs.


Special tokens

Common examples:

  • <pad> — batch padding
  • <bos> / <eos> — start / end
  • <unk> — unknown (if used)

Chat models add tokens for roles (user, assistant) in templated prompts (Module 7).


Context window

Context window = max tokens processed in one forward pass (e.g. 4k, 8k, 128k).

If your document is longer:

  • Truncate (keep head or tail)
  • Chunk with overlap
  • Summarize first

Attention cost grows with sequence length — long context is expensive.


Training vs inference

  • Training: fixed max length; pad shorter sequences in a batch.
  • Inference: prompt length + generated tokens must fit in window.

Your mini transformer might use 128–256 tokens — enough for blog paragraphs on a laptop.


Checkpoint

What is the context window?

Answer sketch

The maximum number of tokens the model can handle in one pass — inputs plus generated output for decoders.


What's next

Lesson 6 — Vectorization