Tokenization and context window
Before we begin
Models never see raw strings — they see token IDs. Tokenization splits text into pieces the vocabulary knows.
What is a token? A unit of text (word, subword, or byte chunk) mapped to an integer the model embeds.
Figure
Text → tokens → embeddings
What you will learn
- Define tokens, vocabulary, and IDs.
- Explain subword tokenization (BPE).
- Describe the context window and truncation.
Before this lesson
Word-level vs subword
Word-level: one ID per dictionary word — huge vocab, many <unk> unknowns.
Subword (BPE, WordPiece, SentencePiece): frequent words stay whole; rare words split:
"transformers"→"transform"+"ers""unhappiness"→"un"+"happiness"
Smaller vocab, fewer unknowns — standard for LLMs.
Special tokens
Common examples:
<pad>— batch padding<bos>/<eos>— start / end<unk>— unknown (if used)
Chat models add tokens for roles (user, assistant) in templated prompts (Module 7).
Context window
Context window = max tokens processed in one forward pass (e.g. 4k, 8k, 128k).
If your document is longer:
- Truncate (keep head or tail)
- Chunk with overlap
- Summarize first
Attention cost grows with sequence length — long context is expensive.
Training vs inference
- Training: fixed max length; pad shorter sequences in a batch.
- Inference: prompt length + generated tokens must fit in window.
Your mini transformer might use 128–256 tokens — enough for blog paragraphs on a laptop.
Checkpoint
What is the context window?
Answer sketch
The maximum number of tokens the model can handle in one pass — inputs plus generated output for decoders.