Tokenization and context window

Before we begin

Models never see raw strings — they see token IDs. Tokenization splits text into pieces the vocabulary knows.

What is a token? A unit of text (word, subword, or byte chunk) mapped to an integer the model embeds.

Figure

Text → tokens → embeddings

Tokenizer runs before the transformer stack.

Word-level: one ID per dictionary word — huge vocab, many <unk> unknowns.

Subword (BPE, WordPiece, SentencePiece): frequent words stay whole; rare words split:

Smaller vocab, fewer unknowns — standard for LLMs.

Common examples:

Chat models add tokens for roles (user, assistant) in templated prompts (Module 7).

Context window = max tokens processed in one forward pass (e.g. 4k, 8k, 128k).

If your document is longer:

Attention cost grows with sequence length — long context is expensive.

Your mini transformer might use 128–256 tokens — enough for blog paragraphs on a laptop.

What is the context window?

Answer sketch

The maximum number of tokens the model can handle in one pass — inputs plus generated output for decoders.