← Back to curriculum

Module 4 — Deep learning architectures

Word embeddings — language as vectors

From one-hot to dense vectors, similarity in embedding space, pre-trained vs trained-from-scratch, and GenAI relevance.

~70 min read + exercises

Word embeddings — language as vectors

Before we begin

Models cannot multiply words directly — they need numbers. One-hot vectors are huge and treat every word as unrelated. Embeddings map each word to a dense vector (e.g. 100–300 dimensions) where similar meaning → nearby points.

What is an embedding? A learned address for a word in meaning-space — the foundation of modern language AI.

Figure

Similar words cluster

Embeddings: similar words → nearby vectorsgoodgreatexcellentbadterribleawful
Positive and negative words form separate regions in vector space.

What you will learn

  • Contrast one-hot vs embedding lookup.
  • Explain why embeddings matter for GenAI.
  • Choose pre-trained vs train from scratch on small data.

Before this lesson


One-hot vs embedding

One-hot for vocabulary size V:

  • Vector length V, almost all zeros
  • “King” and “queen” are equally distant as “king” and “car”

Embedding layer in PyTorch:

  • Matrix E of shape [V, d]
  • Word index i → row E[i] = vector of length d
  • d learned during training (or copied from pre-trained)

How embeddings learn

Training sentiment or language modeling pulls vectors so words that appear in similar contexts move closer (Word2Vec, GloVe intuition).

Examples often cited after training on large corpora:

  • king − man + woman ≈ queen (direction of gender)
  • good close to great, far from terrible

You do not need to derive Word2Vec math here — know the outcome: geometry encodes meaning.


Pre-trained vs train on your reviews

ApproachWhen it helps
Pre-trained (GloVe, fastText, public weights)Small dataset, rare words, faster convergence
Train embedding layer from scratchDomain-specific jargon, enough data
Fine-tune pre-trainedMiddle ground — start general, adapt to products

Module 4 project: try both and compare validation F1.


Embeddings and GenAI

LLMs use token embeddings at the input — same idea at billion-parameter scale. Contextual embeddings (BERT, GPT layers) change the vector based on surrounding words — static Word2Vec is one vector per word always.

Understanding static embeddings makes Module 6 transformers easier.


Simple baselines before LSTM

  1. Average word embeddings → logistic regression
  2. TF-IDF + linear classifier (Module 2 style)
  3. LSTM on embedding sequence (project)

Compare on the same train/val/test split.


Checkpoint

What is an embedding?

Answer sketch

A dense vector representation of a word (or token) where semantic similarity corresponds to distance in vector space.


What's next

Module 4 quiz