Word embeddings — language as vectors

Before we begin

Models cannot multiply words directly — they need numbers. One-hot vectors are huge and treat every word as unrelated. Embeddings map each word to a dense vector (e.g. 100–300 dimensions) where similar meaning → nearby points.

What is an embedding? A learned address for a word in meaning-space — the foundation of modern language AI.

Figure

Similar words cluster

Positive and negative words form separate regions in vector space.

What you will learn

Contrast one-hot vs embedding lookup.
Explain why embeddings matter for GenAI.
Choose pre-trained vs train from scratch on small data.

Before this lesson

Lesson 3 — LSTM & GRU
Module 2 — Spam project (bag-of-words / TF-IDF baseline)

One-hot vs embedding

One-hot for vocabulary size V:

Vector length V, almost all zeros
“King” and “queen” are equally distant as “king” and “car”

Embedding layer in PyTorch:

Matrix E of shape [V, d]
Word index i → row E[i] = vector of length d
d learned during training (or copied from pre-trained)

How embeddings learn

Training sentiment or language modeling pulls vectors so words that appear in similar contexts move closer (Word2Vec, GloVe intuition).

Examples often cited after training on large corpora:

king − man + woman ≈ queen (direction of gender)
good close to great, far from terrible

You do not need to derive Word2Vec math here — know the outcome: geometry encodes meaning.

Pre-trained vs train on your reviews

Approach	When it helps
Pre-trained (GloVe, fastText, public weights)	Small dataset, rare words, faster convergence
Train embedding layer from scratch	Domain-specific jargon, enough data
Fine-tune pre-trained	Middle ground — start general, adapt to products

Module 4 project: try both and compare validation F1.

Embeddings and GenAI

LLMs use token embeddings at the input — same idea at billion-parameter scale. Contextual embeddings (BERT, GPT layers) change the vector based on surrounding words — static Word2Vec is one vector per word always.

Understanding static embeddings makes Module 6 transformers easier.

Simple baselines before LSTM

Average word embeddings → logistic regression
TF-IDF + linear classifier (Module 2 style)
LSTM on embedding sequence (project)

Compare on the same train/val/test split.

Checkpoint

What is an embedding?

Answer sketch

A dense vector representation of a word (or token) where semantic similarity corresponds to distance in vector space.

What's next

Module 4 quiz