Word embeddings — language as vectors
Before we begin
Models cannot multiply words directly — they need numbers. One-hot vectors are huge and treat every word as unrelated. Embeddings map each word to a dense vector (e.g. 100–300 dimensions) where similar meaning → nearby points.
What is an embedding? A learned address for a word in meaning-space — the foundation of modern language AI.
Figure
Similar words cluster
What you will learn
- Contrast one-hot vs embedding lookup.
- Explain why embeddings matter for GenAI.
- Choose pre-trained vs train from scratch on small data.
Before this lesson
- Lesson 3 — LSTM & GRU
- Module 2 — Spam project (bag-of-words / TF-IDF baseline)
One-hot vs embedding
One-hot for vocabulary size V:
- Vector length V, almost all zeros
- “King” and “queen” are equally distant as “king” and “car”
Embedding layer in PyTorch:
- Matrix E of shape
[V, d] - Word index i → row E[i] = vector of length d
- d learned during training (or copied from pre-trained)
How embeddings learn
Training sentiment or language modeling pulls vectors so words that appear in similar contexts move closer (Word2Vec, GloVe intuition).
Examples often cited after training on large corpora:
- king − man + woman ≈ queen (direction of gender)
- good close to great, far from terrible
You do not need to derive Word2Vec math here — know the outcome: geometry encodes meaning.
Pre-trained vs train on your reviews
| Approach | When it helps |
|---|---|
| Pre-trained (GloVe, fastText, public weights) | Small dataset, rare words, faster convergence |
| Train embedding layer from scratch | Domain-specific jargon, enough data |
| Fine-tune pre-trained | Middle ground — start general, adapt to products |
Module 4 project: try both and compare validation F1.
Embeddings and GenAI
LLMs use token embeddings at the input — same idea at billion-parameter scale. Contextual embeddings (BERT, GPT layers) change the vector based on surrounding words — static Word2Vec is one vector per word always.
Understanding static embeddings makes Module 6 transformers easier.
Simple baselines before LSTM
- Average word embeddings → logistic regression
- TF-IDF + linear classifier (Module 2 style)
- LSTM on embedding sequence (project)
Compare on the same train/val/test split.
Checkpoint
What is an embedding?
Answer sketch
A dense vector representation of a word (or token) where semantic similarity corresponds to distance in vector space.