MOD 05 Classical Text Representation
ENSAM Casablanca · 2025/2026 ↩ Home
Deep Learning & NLP — ENSAM Casablanca

Classical Text
Representation

Bag of Words, N-grams, and TF-IDF — the frequency-based family of text representation methods. Results benchmarked on 50,000 IMDb reviews. These methods require no GPU and no training, making them the essential fast baseline before reaching for neural embeddings.

Module05 of 07
DatasetIMDb 50k reviews (25k/25k)
Best resultTF-IDF 1+2gram: 90.13%
Part A
Classical Methods — Bag of Words · N-grams · TF-IDF
01Bag of Words (BoW)

The simplest text representation. Each document becomes a counting vector: how many times does each vocabulary word appear? Word order is completely ignored — the document is treated as a "bag" with no sequence.

Principle

Build a global vocabulary from all documents. Each document is then a vector of length = |vocabulary|, where each entry is the word count in that document.

Reviewmoviebrilliantterriblenotgood
"This film was brilliant"11000
"This film was not good"10011
"Terrible film, not good"10111
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(max_features=20000, min_df=2) X_train = vectorizer.fit_transform(X_train_txt) X_test = vectorizer.transform(X_test_txt) # IMDb 50k Result: 86.18% accuracy with Logistic Regression
IMDb Predictions — Where BoW Fails
ReviewTrue LabelBoW PredictsAnalysis
"This film was absolutely brilliant and I loved it"Positive✓ PositiveKeywords 'brilliant', 'loved' → strong positive signal
"Terrible film, a complete waste of time and money"Negative✓ NegativeKeywords 'terrible', 'waste' → strong negative signal
"The film was not good at all"Negative✗ Positive'not', 'good' seen separately — ambiguous signal
"Not bad, actually quite enjoyable"Positive✓ Positive'enjoyable' compensates — correct by chance!
Advantages

• Quick to implement and use
• Easy to interpret — see exactly which word scores
• No training required
• Matrix 99% sparse → fast with any standard ML classifier

Disadvantages

Negation invisible: "not good" ≈ "good" for BoW
Synonyms unlinked: "great" and "excellent" are separate features
Order lost: "John beats Paul" = "Paul beats John"
• Matrix 99% sparse/hollow (high memory)

Real-World Use Cases Where BoW Is Optimal
Use CaseExampleWhy BoW Works
Anti-spam filterGmail, Outlook"lottery", "click here", "win" → spam. Keywords are enough.
Article categorisationBBC News, Reuters"football", "goal" → Sport. Domain-specific vocabulary.
Document searchElasticsearch, SolrQuery = doc, TF-IDF similarity. Fast on millions of docs.
02N-grams — Capturing Sequences

An N-gram is a sequence of N consecutive words. By including multi-word sequences (bigrams, trigrams), we capture some local context that pure BoW misses — especially negation patterns.

Phrase: "The film was not good" Unigrams (1): the | film | was | not | good Bigrams (2): "the film" | "film was" | "was not" | "not good" ← captures negation! Trigrams (3): "the film was" | "film was not" | "was not good" Key insight: bigram "not good" = strong NEGATIVE signal unigrams treat "not" and "good" separately → ambiguous
N-gram TypeIMDb Accuracy (50k)Notes
Unigram (1)86.18%Same as basic BoW
Bigram (2)~88–89%Captures 2-word phrases and simple negation
Trigram (3)90.11%Best for sentence-level patterns
from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(ngram_range=(1, 3), max_features=50000) X = vect.fit_transform(corpus) # ngram_range=(1,3) includes unigrams, bigrams, and trigrams
Advantages

• Fast and interpretable
• Handles negation: "not good", "really bad" captured as a unit
• Easy to implement on top of BoW pipeline

Disadvantages

Vocabulary explosion: V² to V³ dimensions — memory intensive
• Still blind to synonyms and polysemy
• Long-range dependencies still missed

03TF-IDF — Weighting Words by Their Rarity

TF-IDF (Term Frequency – Inverse Document Frequency) improves BoW by weighting words: a word is important if it is frequent in this document but rare in the rest of the collection. This naturally downweights stop words and upweights distinctive vocabulary.

TF-IDF Formula $$\text{TF}(t,d) = \frac{\text{count}(t,d)}{|d|} \qquad \text{IDF}(t) = \log\!\left(\frac{N}{df(t)}\right) + 1$$ $$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$$

$N$: total documents · $df(t)$: documents containing term $t$ · High IDF = rare = informative

Real IDF Values on 50k IMDb
Word TypeExampleReal IDF (50k corpus)Role
Stop word'is', 'it', 'in'~1.1Nearly ignored — noise
Neutral word'film', 'movie'~3–4Lightweight — too common in ALL reviews
Discriminatory (+)'brilliant', 'masterpiece'High IDFStrong positive signal
Discriminatory (−)'terrible', 'awful'High IDFStrong negative signal
from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), sublinear_tf=True) X = vect.fit_transform(corpus) # IMDb Results (50k, 25k/25k split): # BoW unigram: 86.18% # TF-IDF unigram: 88.89% (+2.71% improvement) # TF-IDF 1+2 gram: 90.13% ← Best classical method # 5-Fold CV: 0.9026 ± 0.0044 (stable)
Interpretation — Why 90.13%?
  • IDF reduces stop-word noise ("is", "the" get near-zero weight automatically)
  • Bigrams add +3.94% vs simple BoW by capturing "not good", "absolutely brilliant"
  • Distinctive words like 'brilliant', 'masterpiece', 'terrible', 'awful' receive high weight
  • 5-fold cross-validation confirms stability (0.9026 ± 0.0044)
Advantages

140× faster than BERT — processes 1M documents/hour on a CPU
• Total interpretability: see exactly which word drives the outcome
• Clever automatic stop-word filtering via IDF
• No GPU, no neural training required

Limitations

Synonyms ignored: "great" and "excellent" are two separate features
Word order lost (same as BoW)
Polysemy blind: "bank" = bank AND river — same weight
• No semantic understanding at all

Production Use Cases for TF-IDF
DomainExampleWhy TF-IDF Works
Document searchElasticsearch · Wikipedia · internal searchQuery-document similarity in sparse vector space
Press classificationBBC News → 96% accuracy with TF-IDF aloneDomain vocabulary is stable and distinctive
Legal analysisContract review, clause identificationPrecise, standardised vocabulary
Medical recordsICD codes, diagnostic classificationRare medical terms have very high IDF → informative
04Shared Limitations of All Classical Methods

BoW, N-grams, and TF-IDF all belong to the frequency-based family. They share the same fundamental ceiling:

LimitationWhat It MeansExample
No semantic meaning"film" and "movie" are completely unrelated featuresSynonym vocabulary explosion without semantic clustering
No contextEach word is independent — word order ignored"John beats Paul" = "Paul beats John" in all classical methods
Polysemy ignored"bank" gets the same representation in all contextsFinancial bank = river bank → no disambiguation
OOV (Out of Vocabulary)Unknown words = ignored entirelyTypos, slang, neologisms invisible to the model
Sparse representationVocabulary can be 50k+ dimensions; each document has mostly zerosMemory-intensive; distances in high-dimensional sparse space are noisy
Why these limitations matter: "The film was not good at all" — classical methods (except TF-IDF with bigrams) predict Positive because "good" appears and the negation "not" is treated as a separate, low-signal token. Moving to neural embeddings (Module 06) addresses polysemy and context; moving to BERT addresses negation and nuance.
05When to Stop at Classical Methods

Despite their limitations, classical methods remain the correct first choice in many real-world scenarios:

Use Classical When

• No GPU available (CPU-only inference)
• Speed is critical (real-time, high-volume pipelines)
• Interpretability is required (legal, medical, compliance)
• Domain vocabulary is stable and distinctive
• Data is domain-specific with little polysemy

Move to Neural When

• Context and meaning matter (negation, irony, polysemy)
• OOV words appear frequently (social media, medical)
• Semantic similarity is needed beyond exact keywords
• GPU is available and inference latency is acceptable
• 90%+ is not good enough for your task

Decision Ladder (from Module 06)

Step 1: TF-IDF with bigrams — fast baseline, no GPU
Step 2: FastText — if OOV/noisy text is a problem
Step 3: BERT fine-tuned — if polysemy, negation, or nuance is critical