Word Embeddings
& Neural Representations
Static embeddings (Word2Vec, GloVe, FastText) and contextual embeddings (BERT). From dense vector spaces and vector arithmetic to bidirectional context and pre-training. Benchmarked on 50,000 IMDb reviews. Based on ENSAM 2025/2026 lecture PDFs.
Classical methods (BoW, TF-IDF) represent each word as a unique dimension in a sparse vocabulary-sized vector. No two words are ever "close" to each other — "film" and "movie" are as distant as "film" and "elephant".
The key insight of neural word embeddings:
"A word is characterized by the company it keeps."
"film" and "movie" appear in nearly the same contexts → they should have similar vectors.
| Property | Classical (TF-IDF) | Static Embeddings (Word2Vec) |
|---|---|---|
| Vector size | |Vocabulary| (50,000+) | 100–300 dimensions |
| Sparsity | 99%+ zeros | Dense — all dimensions meaningful |
| Synonyms | "great" ≠ "excellent" (different dims) | "great" ≈ "excellent" (cosine ~ 0.85) |
| Semantics | No semantic meaning | Semantic clusters in vector space |
| Training required | No (counting) | Yes (neural network on large corpus) |
| Era | Methods | Key Innovation |
|---|---|---|
| 1990s–2000s | BoW, TF-IDF | Frequency counting — fast, interpretable |
| 2013 | Word2Vec (Google) | Dense vectors, semantic similarity, vector arithmetic |
| 2014 | GloVe (Stanford) | Global co-occurrence statistics |
| 2016 | FastText (Facebook) | Subword decomposition — handles OOV |
| 2018 | ELMo, BERT (Google) | Contextual representations — different vector per occurrence |
Word2Vec (Mikolov et al., Google, 2013) trains a shallow neural network to predict words from their context. The actual task (prediction) is discarded — the learned weight matrix becomes the embedding lookup table.
Given the center word, predict the surrounding context words. Better for rare words. Works well on smaller datasets.
Given the context words, predict the center word. Faster to train. Better for frequent words and larger datasets.
| Use Case | Example |
|---|---|
| Film Recommendations | "Inception" → vector → find similar films in embedding space |
| Semantic Search | Query "voiture" finds results mentioning "auto", "automobile" |
| Chatbots | "problem" cluster: "bug", "error", "issue" — all mapped nearby |
| Entity disambiguation | Names of similar people/places cluster together |
• Captures semantic similarity and synonyms
• Vector analogies (king−man+woman≈queen)
• Dense vectors: 100–300 dimensions
• Pre-trained models available (Google News 300d)
• One vector per word → "bank" (financial) = "bank" (river) — polysemy ignored
• OOV: unknown words = no representation
• Local context only (fixed window)
• Needs billions of words for quality vectors
GloVe factorizes a global co-occurrence matrix using Singular Value Decomposition (SVD). Instead of predicting local context (like Word2Vec), GloVe leverages overall corpus statistics — how often every word pair co-occurs across the entire corpus.
- Analogy champion: Best performance on WordSim-353 and SimLex-999 benchmarks.
- Limitation: The V×V co-occurrence matrix becomes impossible on large vocabularies (RAM).
FastText extends Word2Vec by breaking words into character n-grams (subwords). The embedding of a word is the sum of its subword embeddings. This solves the OOV (Out-of-Vocabulary) problem — any word, even one unseen during training, can be represented using its subword components.
| Domain | Why FastText |
|---|---|
| Social media ("amazingg", "luv", "gr8") | Constant OOV from slang, abbreviations, typos |
| Medical NLP ("hypertension", "cardiomyopathy") | Rare medical terms → subword decomposition guarantees a vector |
| Morphologically rich languages (Arabic, Turkish, Finnish) | Complex suffixes handled via subwords — FastText global standard |
| Noisy OCR / SMS / forums | "necesary" → subword overlap with "necessary" → correct cluster |
| Criteria | Word2Vec | GloVe | FastText |
|---|---|---|---|
| Handles OOV words | No | No | Yes |
| Analogy performance | Good | Excellent | Good |
| Training speed | Fast | Slower | Fast |
| Morphological richness | No | No | Yes |
| Noisy text robustness | Poor | Poor | Strong |
| Best use case | Semantic similarity, clean text | Benchmarks, analogies | Social media, medical, OOV |
BANK = Financial Institution
"He's fishing on the bank's shore."
BANK = Riverbank
Word2Vec · GloVe · FastText → "bank" has the SAME vector in both sentences. This is fundamentally wrong for any task requiring understanding of meaning in context.
Polysemy is the phenomenon where a single word has multiple meanings. All static embeddings collapse all senses into one average vector. The solution requires reading the entire sentence and computing a different vector for each occurrence of the word — a vector that reflects the specific meaning in that context.
This is exactly what contextual embeddings (ELMo, BERT, GPT) do. The transformer attention mechanism allows every word to "look at" every other word in the sentence before its representation is computed.
BERT (Devlin et al., Google, 2018) — based on the Transformer encoder architecture. Instead of a fixed vector per word, each word receives a different vector depending on its full bidirectional context. BERT reads the entire sentence from left-to-right AND right-to-left simultaneously (not sequentially like RNNs).
"The [MASK] was brilliant" → predicts "film". 15% of tokens are masked randomly. Forces the model to use bidirectional context to fill in blanks — cannot cheat by looking only left or right.
Given sentences A and B, predict: does B logically follow A? Teaches the model inter-sentence relationships — useful for QA, inference, summarisation tasks.
| Specification | BERT-Base | BERT-Large |
|---|---|---|
| Transformer encoder layers | 12 | 24 |
| Attention heads | 12 | 16 |
| Hidden dimension | 768 | 1024 |
| Total parameters | ~110M | ~340M |
| Pre-training data | Wikipedia + BooksCorpus (3.3 billion words) | |
| Tokenisation | WordPiece — handles OOV via subword splitting | |
• Bidirectional context: "not good" (negative) vs "not bad" (positive) — BERT distinguishes them; W2V cannot
• Polysemy resolved: "bank" financial ≠ "bank" geographical → different vectors
• Massive pre-training: 3.3B words → rich linguistic knowledge via transfer learning
• Handles irony, negation, nuance naturally
• 140× slower than FastText — impossible for real-time applications without a GPU
• GPU almost mandatory (fine-tuning: 30–120 min; production: GPU inference)
• Large labeled dataset needed for fine-tuning
• Black box: attention patterns are hard to interpret
| Domain | Why BERT Is Essential |
|---|---|
| Chatbot & Question Answering | Nuanced understanding of interconnected, multi-sentence context |
| Legal analysis | Critical semantic nuances in contracts and clauses |
| Medical diagnosis NLP | Irony, negation in clinical notes ("patient shows no signs of...") |
| Hate speech / sentiment detection | Sarcasm and irony impossible without bidirectional context |
| Automatic translation | BERT Multilingual: 104 languages in one model |
All methods tested on the same 50,000 IMDb reviews (25k train / 25k test), sentiment classification task.
| Method | Type | Accuracy | F1-macro | Latency | GPU? | Interpretability | OOV |
|---|---|---|---|---|---|---|---|
| BoW | Classical | 86.2% | 0.86 | ~1 ms | No | Excellent | No |
| N-gram (1,2) | Classical | ~89% | ~0.89 | ~3 ms | No | Very good | No |
| TF-IDF | Classical | 90.1% | 0.90 | ~3 ms | No | Excellent | No |
| Word2Vec | Static | 85.7% | 0.86 | ~12 ms | No | Weak | No |
| GloVe | Static | 76.8% | 0.77 | ~10 ms | No | Weak | No |
| FastText | Static | 86.0% | 0.86 | ~15 ms | No | Weak | Yes |
| BERT fine-tuned | Contextual | 93.9% | 0.94 | ~370 ms (CPU) | Yes | Very weak | Yes |
| Review | True | BoW | TF-IDF | W2V | FastText | BERT |
|---|---|---|---|---|---|---|
| "This film was absolutely brilliant" | Pos | ✓ | ✓ | ✓ | ✓ | ✓ |
| "This film was terrible and boring" | Neg | ✓ | ✓ | ✓ | ✓ | ✓ |
| "This movie was not good at all" | Neg | ✗ | ✓ | ✗ | ✗ | ✓ |
| "Not bad, quite enjoyable" | Pos | ✓ | ✓ | ✗ | ✗ | ✓ |
| "I expected something better" | Neg | ✓ | ✓ | ✓ | ✓ | ✓ |
| "A film that was surprisingly good" | Pos | ✓ | ✓ | ✓ | ✓ | ✓ |
- TF-IDF (with bigrams): 6/6 correct — bigram "not good" = strong negative signal. Best classical method.
- BERT: 6/6 correct — bidirectional context resolves "not good" (negative) vs "not bad" (positive).
- BoW: 5/6 — fails on "not good" (treats "not" and "good" separately).
- W2V / FastText: 4/6 — fail on negation ("not good", "not bad") because mean pooling loses word order.
Step 1: TF-IDF (fast baseline, no GPU, fully interpretable)
Step 2: FastText (if OOV words are a problem — noisy text, medical, social media)
Step 3: BERT fine-tuned (if polysemy, negation, or nuance is critical + GPU available)
| Task Requirement | Best Choice | Reason |
|---|---|---|
| Need context / polysemy / negation | BERT fine-tuned | Only method that gives a different vector per occurrence |
| Noisy data / OOV words (typos, slang) | FastText | Subword decomposition handles any word — never gets OOV |
| Need similarity / analogies | Word2Vec or GloVe | Semantic vector space captures word relationships |
| Fast, interpretable, no GPU, no training data | TF-IDF | 140× faster than BERT, fully explainable, CPU-only |
Each family solves the fundamental limitations of the previous one. Increasing complexity → better meaning capture → more resources required.
| Family | Methods | Representation | Semantics | Context | OOV | Key Strength |
|---|---|---|---|---|---|---|
| Classical | BoW · N-grams · TF-IDF | Frequency vector / weight — each word = one dimension | None — "film" and "movie" are unrelated | None — word order ignored | Word ignored entirely | Speed · Interpretability · No GPU · No training |
| Static Embeddings | Word2Vec · GloVe · FastText | Dense vector 100–300d learned from co-occurrence | Semantic similarity captured ("film" ≈ "movie", cosine 0.89) | One vector per word — "bank" = same in all sentences | FastText ✅ · W2V/GloVe ❌ | Semantic similarity · Vector analogies · Compact representation |
| Contextual | ELMo · BERT · GPT | Dynamic vector 768–1024d — recomputed per context | Deep semantic meaning: nuances, irony, polysemy | Bidirectional — "bank" gets different vectors in different sentences | WordPiece subwords ✅ | Negation · Irony · Polysemy · Bidirectional context |