← Back to curriculum

Module 6 — Transformers (core of GenAI)

Vectorization — text to vectors

Token embeddings, positional encodings, sentence embeddings for retrieval, and how vectorization connects tokenization to attention.

~70 min read + exercises

Vectorization — text to vectors

Before we begin

Tokenization maps text to integer IDs. Vectorization maps those IDs to dense vectors the transformer can compute with.

Vectorization is the step where discrete tokens become continuous numbers — the bridge between human language and matrix math.

Figure

Text → tokens → vectors

TextstringTokenizerBPEToken IDs[42, 7…]Embeddingsvectors
Each token ID is looked up in an embedding table to produce a vector.

What you will learn

  • Explain token embeddings inside an LLM.
  • Describe positional encoding and why order matters.
  • Distinguish in-model embeddings from retrieval embeddings (RAG).
  • Connect vectorization to attention (Module 6 Lessons 1–2).

Before this lesson


Token embeddings

Each token ID i maps to a learned vector eᵢ of dimension d (often 768–4096 in production).

The model stores an embedding table — a matrix of shape [vocab_size, d]. Row i is the vector for token i.

Why vectors, not one-hot?

  • One-hot vectors are huge and sparse (vocab size 50k+).
  • Dense embeddings pack semantic similarity — related tokens sit closer in space (same idea as Module 4 word2vec).

Positional information

Self-attention is permutation-invariant without extra signal — "cat sat" and "sat cat" would look the same.

Transformers add positional encodings (sinusoidal or learned) to each token vector so the model knows order.

ApproachIdea
SinusoidalFixed math functions of position (original paper)
LearnedTrainable position vectors (common in GPT-style models)
RoPERotate query/key vectors by position (Llama, many modern LLMs)

You do not need to implement RoPE — know that position is explicit, not inferred from token IDs alone.


From tokens to attention input

Full pipeline for one forward pass:

  1. Tokenize[101, 4523, 892, 102]
  2. Embed → matrix [seq_len, d]
  3. Add position → same shape, position-aware
  4. Self-attention → each row becomes a contextual vector blending other positions

Vectorization is steps 2–3; attention (Lessons 1–2) is step 4.


Sentence embeddings (retrieval)

For RAG (Module 7), a separate model often maps whole sentences to one vector for search — not the same table as inside the LLM.

In-LLM token embeddingRetrieval / sentence embedding
GranularityPer tokenPer chunk or sentence
ModelPart of GPT weightse.g. text-embedding-3-small
UseGenerationFind similar docs

Same dot-product / cosine similarity intuition from Module 1 — higher dimension, same geometric idea.


Common mistakes

  • Confusing tokenizer vocabulary with embedding dimension — vocab is count of IDs; d is vector length.
  • Assuming longer text = more embedding dimensions — length is sequence length; each position still has dimension d.
  • Using generation model embeddings for retrieval without testing — dedicated embedding models often win on search quality.

What's next

Module 6 quiz