Vectorization — text to vectors

Before we begin

Tokenization maps text to integer IDs. Vectorization maps those IDs to dense vectors the transformer can compute with.

Vectorization is the step where discrete tokens become continuous numbers — the bridge between human language and matrix math.

Figure

Text → tokens → vectors

Each token ID is looked up in an embedding table to produce a vector.

Each token ID i maps to a learned vector eᵢ of dimension d (often 768–4096 in production).

The model stores an embedding table — a matrix of shape [vocab_size, d]. Row i is the vector for token i.

Why vectors, not one-hot?

One-hot vectors are huge and sparse (vocab size 50k+).
Dense embeddings pack semantic similarity — related tokens sit closer in space (same idea as Module 4 word2vec).

Self-attention is permutation-invariant without extra signal — "cat sat" and "sat cat" would look the same.

Transformers add positional encodings (sinusoidal or learned) to each token vector so the model knows order.

Approach	Idea
Sinusoidal	Fixed math functions of position (original paper)
Learned	Trainable position vectors (common in GPT-style models)
RoPE	Rotate query/key vectors by position (Llama, many modern LLMs)

You do not need to implement RoPE — know that position is explicit, not inferred from token IDs alone.

Full pipeline for one forward pass:

Tokenize → [101, 4523, 892, 102]
Embed → matrix [seq_len, d]
Add position → same shape, position-aware
Self-attention → each row becomes a contextual vector blending other positions

Vectorization is steps 2–3; attention (Lessons 1–2) is step 4.

For RAG (Module 7), a separate model often maps whole sentences to one vector for search — not the same table as inside the LLM.

	In-LLM token embedding	Retrieval / sentence embedding
Granularity	Per token	Per chunk or sentence
Model	Part of GPT weights	e.g. `text-embedding-3-small`
Use	Generation	Find similar docs

Same dot-product / cosine similarity intuition from Module 1 — higher dimension, same geometric idea.

Confusing tokenizer vocabulary with embedding dimension — vocab is count of IDs; d is vector length.
Assuming longer text = more embedding dimensions — length is sequence length; each position still has dimension d.
Using generation model embeddings for retrieval without testing — dedicated embedding models often win on search quality.