Vectorization — text to vectors
Before we begin
Tokenization maps text to integer IDs. Vectorization maps those IDs to dense vectors the transformer can compute with.
Vectorization is the step where discrete tokens become continuous numbers — the bridge between human language and matrix math.
Figure
Text → tokens → vectors
What you will learn
- Explain token embeddings inside an LLM.
- Describe positional encoding and why order matters.
- Distinguish in-model embeddings from retrieval embeddings (RAG).
- Connect vectorization to attention (Module 6 Lessons 1–2).
Before this lesson
Token embeddings
Each token ID i maps to a learned vector eᵢ of dimension d (often 768–4096 in production).
The model stores an embedding table — a matrix of shape [vocab_size, d]. Row i is the vector for token i.
Why vectors, not one-hot?
- One-hot vectors are huge and sparse (vocab size 50k+).
- Dense embeddings pack semantic similarity — related tokens sit closer in space (same idea as Module 4 word2vec).
Positional information
Self-attention is permutation-invariant without extra signal — "cat sat" and "sat cat" would look the same.
Transformers add positional encodings (sinusoidal or learned) to each token vector so the model knows order.
| Approach | Idea |
|---|---|
| Sinusoidal | Fixed math functions of position (original paper) |
| Learned | Trainable position vectors (common in GPT-style models) |
| RoPE | Rotate query/key vectors by position (Llama, many modern LLMs) |
You do not need to implement RoPE — know that position is explicit, not inferred from token IDs alone.
From tokens to attention input
Full pipeline for one forward pass:
- Tokenize →
[101, 4523, 892, 102] - Embed → matrix
[seq_len, d] - Add position → same shape, position-aware
- Self-attention → each row becomes a contextual vector blending other positions
Vectorization is steps 2–3; attention (Lessons 1–2) is step 4.
Sentence embeddings (retrieval)
For RAG (Module 7), a separate model often maps whole sentences to one vector for search — not the same table as inside the LLM.
| In-LLM token embedding | Retrieval / sentence embedding | |
|---|---|---|
| Granularity | Per token | Per chunk or sentence |
| Model | Part of GPT weights | e.g. text-embedding-3-small |
| Use | Generation | Find similar docs |
Same dot-product / cosine similarity intuition from Module 1 — higher dimension, same geometric idea.
Common mistakes
- Confusing tokenizer vocabulary with embedding dimension — vocab is count of IDs;
dis vector length. - Assuming longer text = more embedding dimensions — length is sequence length; each position still has dimension
d. - Using generation model embeddings for retrieval without testing — dedicated embedding models often win on search quality.