Summaries Course Summaries — 7 Modules
Exam-Ready Summaries

Course Summaries

Condensed version of all 7 modules. Every formula, benchmark, and concept needed to pass the Q&A and the exam — in one place.

Modules7
TopicsANN · CNN · RNN/LSTM · NLP · Text Rep · Embeddings · Transformers
ExamMay 18, 2026
01
Deep Learning Essentials & ANN
Perceptron, Activation, Backprop, Optimizers, Regularization

1. Deep Learning vs Machine Learning

DL = multi-layered neural networks that learn features automatically from data. ML requires manual feature engineering. DL needs large data + GPU; ML works with less. DL is less interpretable (black box) but state-of-the-art on complex tasks.

2. Perceptron & Activation Functions

A perceptron computes \(z = W\cdot X + b\), then applies activation: \(y = f(z)\). Without activation, stacking layers = one linear transform.

FunctionFormulaRangeUse
Sigmoid\(1/(1+e^{-x})\)(0,1)Binary output; vanishing gradient for large |x|
Tanh\((e^x-e^{-x})/(e^x+e^{-x})\)(−1,1)Zero-centered; also saturates
ReLU\(\max(0,x)\)[0,∞)Default for hidden layers; fast, no saturation for x>0
Leaky ReLU\(\max(0.01x,x)\)(−∞,∞)Fixes dying ReLU (always-zero neuron)
Softmax\(e^{x_i}/\sum e^{x_j}\)(0,1), sum=1Multi-class output

3. Loss Functions

MSE (regression): \(\frac{1}{n}\sum (y_i - \hat{y}_i)^2\) — penalizes large errors heavily.
Binary Cross-Entropy: \(-\frac{1}{n}\sum[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]\) — classification gold standard.
Categorical Cross-Entropy: used with Softmax for multi-class.

4. Gradient Descent & Backpropagation

Update rule: \(W = W - \alpha\frac{\partial L}{\partial W}\) where α = learning rate. Backprop computes \(\frac{\partial L}{\partial W}\) via the chain rule: \(\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}\).

VariantData per updateTradeoff
Batch GDALL dataStable but slow
SGD1 sampleFast but noisy
Mini-batch32–256Best balance (default)

5. Optimizers

OptimizerKey idea
SGDPlain update \(W = W - \alpha\nabla L\)
MomentumAccumulates velocity: \(v = \beta v - \alpha\nabla L\)
RMSpropAdaptive LR per parameter: divide by √(moving avg of squared grads)
AdamMomentum + RMSprop combined — best general-purpose. β₁=0.9, β₂=0.999

6. Regularization

  • L2 (Weight Decay): adds \(\lambda\|W\|^2\) to loss — penalizes large weights
  • L1: adds \(\lambda\|W\|_1\) — promotes sparsity (zero weights)
  • Dropout: randomly zeros p% of neurons during training. At inference, all active but scaled by (1−p). Typical p=0.2–0.5
  • BatchNorm: \(\hat{x} = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\varepsilon}}\), then \(y = \gamma\hat{x} + \beta\) — stabilizes, speeds up, regularizes
  • Early Stopping: stop when val_loss stops decreasing

7. ANN Architecture & Parameter Count

A fully-connected block: Linear → BatchNorm → ReLU → Dropout. Parameters per layer = \(in \times out + out\) (weights + biases). Example: 784→256→128→10 = 200,960 + 32,896 + 1,290 = ~235K params.

Weight initialization: Xavier for Sigmoid/Tanh (\(\sqrt{2/(in+out)}\)), He for ReLU (\(\sqrt{2/in}\)). Too small → vanishing; too large → exploding.

8. Evaluation Metrics

\(\text{Accuracy} = \frac{TP+TN}{Total}\) — use when classes are balanced.
\(\text{Precision} = \frac{TP}{TP+FP}\) — when FP is costly (spam).
\(\text{Recall} = \frac{TP}{TP+FN}\) — when FN is costly (disease).
\(\text{F1} = 2\frac{P\cdot R}{P+R}\) — balance when classes are imbalanced.

9. Training Diagnostics

PatternDiagnosisAction
train↓ val↓GoodContinue
train↓ val↑OverfittingMore dropout, reduce capacity, early stop
Both highUnderfittingIncrease capacity, train longer
Large gapOverfittingMore regularization
Key exam insight
Dropout ON during training, OFF at inference (weights scaled). BatchNorm behaves differently: uses batch stats during training, running averages at inference.
02
Convolutional Neural Networks (CNN)
Conv2d, Pooling, Transfer Learning, Medical Imaging

1. Why CNN? (vs ANN on images)

A 224×224×3 image has 150K pixels. A single ANN layer of 1024 units = 154M parameters — completely impractical. CNNs solve this with: local connectivity (each neuron sees a patch), weight sharing (same filter slides across image), and hierarchical learning (edges → textures → objects).

2. Conv2d Output Size

\(W_{out} = \lfloor\frac{W_{in} - K + 2P}{S}\rfloor + 1\)

Examples: 28×28, K=3, P=1, S=1 → 28 (same). 28×28, K=3, P=0, S=1 → 26 (shrinks). 28×28, K=2, S=2 → 14 (halved).

3. Conv2d Parameters

\(\text{Params} = (K \times K \times C_{in} + 1) \times C_{out}\)

Conv2d(3→32, K=3): (3×3×3+1)×32 = 896 parameters — orders of magnitude fewer than ANN.

4. Padding & Pooling

  • padding=0 (valid): no padding, spatial size shrinks each layer
  • padding=1 (same, K=3): 1 border of zeros, output = input size
  • MaxPool2d(2,2): takes max in 2×2 window, stride 2 → halves dimensions
  • AdaptiveAvgPool2d((1,1)): squeezes any spatial map to 1×1 (global pooling)

5. Standard CNN Block

Conv2d → BatchNorm2d → ReLU → MaxPool2d(2) → Dropout(p)

Conv extracts features, BN normalizes, ReLU adds non-linearity, MaxPool downsamples, Dropout regularizes.

6. Architecture Patterns (from lab)

ModelDomainStructureFinal dims
DigitCNNMNIST (28×28×1)2 conv blocks + 2 FC1→32→64 → 3136 → 128 → 10
ChestCNNX-ray (224×224×3)4 conv blocks + 2 FC3→32→64→128→256 → 256 →128 →2

Why Dropout 0.25 in conv blocks, 0.5 in classifier? Dense layers have far more parameters → higher overfitting risk → need stronger regularization.

7. Transfer Learning — 5 Architectures

ModelYearParamsInnovationHead
VGG162014138MUniform 3×3 convs, deepclassifier[6]
GoogLeNet20146.8MInception (multi-scale in parallel)fc
ResNet50201525MSkip connections (F(x)+x)fc
MobileNetV220183.4MDepthwise separable conv (8× fewer ops)classifier[1]
EfficientNet20195.3MCompound scaling (depth+width+resolution)classifier[1]
Two transfer learning strategies
Feature extraction: freeze backbone, train only head. Use for small datasets (<5k), similar domain. LR ~ 1e-4.
Fine-tuning: unfreeze all/some layers. Use for larger datasets, different domain. LR ~ 1e-5 (pre-trained weights are fragile).

8. Data Augmentation

Applied to training set only. Never val/test. Standard: RandomHorizontalFlip, RandomRotation(10°). Never use vertical flip for medical (anatomically wrong). Normalize with ImageNet stats: mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225].

9. Class Imbalance

Two solutions: WeightedRandomSampler (each batch ~balanced) or weighted CrossEntropyLoss (penalize minority mistakes more). For medical: prioritize Recall (Sensitivity) — missing a pneumonia case is fatal.

The 16-sample validation trap
Tiny val sets give unreliable accuracy — ±1 sample = ±6.25% swing. Use test set as primary estimate, validation only for early stopping signal.
03
RNN, LSTM & GRU
Recurrence, Gating, BPTT, Time Series, Text Generation

1. Why Recurrent Networks?

ANNs/CNNs assume independent inputs. Sequences (text, time series, speech) have temporal dependencies — "not good" ≠ "good not". RNNs maintain a hidden state that carries information across time steps.

2. Vanilla RNN

\(h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h)\)
\(y_t = W_{hy}h_t + b_y\)

W_hh and W_xh are shared across all time steps — same number of parameters regardless of sequence length.

3. The Vanishing Gradient Problem

In Backpropagation Through Time (BPTT), gradients flow backward through T time steps, multiplied by W_hh each step. If eigenvalues < 1 → gradients vanish (early time steps have no influence). If > 1 → gradients explode. Consequence: vanilla RNNs cannot learn long-range dependencies (>20–30 steps).

4. LSTM — The Solution

LSTM adds a cell state c_t (long-term memory) and 3 gates that control information flow:

GateFormulaRole
Forget\(f_t = \sigma(W_f[h_{t-1},x_t] + b_f)\)What to erase from c_{t-1}
Input\(i_t = \sigma(W_i[h_{t-1},x_t] + b_i)\)What new info to store
Candidate\(g_t = \tanh(W_g[h_{t-1},x_t] + b_g)\)New candidate values
Output\(o_t = \sigma(W_o[h_{t-1},x_t] + b_o)\)What to expose from c_t

Cell state update: \(c_t = f_t \odot c_{t-1} + i_t \odot g_t\) — addition creates a gradient highway, solving vanishing gradients.
Hidden state: \(h_t = o_t \odot \tanh(c_t)\)

5. GRU — Simplified LSTM

GRU merges cell+hidden into one state, uses 2 gates (reset + update) instead of 3. Fewer parameters (~3/4 of LSTM), often matches LSTM performance.

\(z_t = \sigma(W_z[h_{t-1},x_t])\) (update gate — interpolates old vs new)
\(r_t = \sigma(W_r[h_{t-1},x_t])\) (reset gate — how much past to use)
\(h_t = (1-z_t) \odot h_{t-1} + z_t \odot n_t\)

6. RNN vs LSTM vs GRU

AspectVanilla RNNLSTMGRU
Statesh_t onlyh_t + c_th_t only
Gates032
ParametersFewestMost (~4× RNN)Middle (~3× RNN)
Long-rangePoorExcellentGood

7. Gradient Clipping

nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) — apply AFTER loss.backward(), BEFORE optimizer.step(). Essential for RNNs to prevent exploding gradients.

8. PyTorch RNN Differences

ModuleReturnsInitial state
nn.RNN(output, h_n)h0 only
nn.LSTM(output, (h_n, c_n))Tuple (h0, c0) ← DIFFERENT!
nn.GRU(output, h_n)h0 only (same as RNN)

9. Text Generation Pipeline

Language model predicts \(P(w_t | w_{. Architecture: Embedding → Dropout → RNN/LSTM/GRU → Linear(vocab_size). Training: input = sequence[0:T], target = sequence[1:T+1] (shifted by 1). Loss = CrossEntropy on last time step logits.

Generation: autoregressive — feed seed, get distribution over next word, sample, append, repeat. Temperature: T<1 = sharper (repetitive), T>1 = flatter (diverse). Top-k: restrict to top-k candidates (k=40 typical).

10. Perplexity

\(\text{PPL} = \exp(\text{cross_entropy_loss})\). Lower = less surprised = better model. Random = V (vocab size), good English LM = 20–100.

11. Time Series Forecasting

Preprocessing: MinMaxScaler (fit ONLY on train), sliding window construction (SEQ_LEN days → next value). NEVER shuffle time series — use chronological split (train=first 70%, val=next 15%, test=last 15%). Shuffling causes look-ahead bias.

ACF (Autocorrelation Function): correlation between y_t and y_{t-k}. Use to choose SEQ_LEN — where ACF drops below 0.5.

04
NLP Text Processing & Linguistic Analysis
Cleaning, Tokenization, Stemming/Lemmatization, POS, NER, Parsing

1. The 4-Stage NLP Pipeline

Raw Text 1. Clean 2. Tokenize 3. Remove Stopwords 4. Normalize

2. Stage 1: Text Cleaning

Pipeline: lowercase → remove HTML (<[^>]+>) → remove URLs → remove punctuation → normalize whitespace (\s+ → single space). Optional: remove numbers.

3. Stage 2: Tokenization

Sentence tokenization: split paragraph into sentences. Word tokenization: split sentence into words. Penn Treebank style treats punctuation as separate tokens: "it's" → ["it", "'s"].

4. Stage 3: Stop Word Removal

Remove high-frequency, low-information words (the, is, at, by...). When NOT to remove: sentiment analysis ("not good"), authorship attribution (stop words are style markers), neural models (learn importance automatically).

5. Stage 4: Stemming vs Lemmatization

MethodMechanismOutputExampleSpeed
PorterStemmerRule-based suffix strippingOften non-word"studies" → "studi"Fast
WordNetLemmatizerDictionary lookup + POSValid word"ran" → "run"Slower

Critical: WordNetLemmatizer needs pos='v' for verbs! lemmatize("ran") → "ran" (wrong), lemmatize("ran", pos='v') → "run" (correct).

6. Context-Free Grammar (CFG)

Production rules describe sentence structure: S → NP VP, NP → Det N, VP → V NP, PP → P NP. Parse tree for "the cat sat on the mat": S dominates NP("the cat") + VP("sat on the mat").

7. POS Tagging

TagCategoryExamplePenn Treebank
NOUNNoun"model"NN, NNS, NNP
VERBVerb"trained"VB, VBD, VBZ
ADJAdjective"deep"JJ
ADPPreposition"on", "with"IN

8. NER (Named Entity Recognition)

Labels: PERSON, ORG, GPE (countries/cities), DATE, MONEY, EVENT. spaCy: doc.ents returns entity spans with ent.label_.

9. Dependency Parsing

Reveals grammatical structure as a directed tree. Key labels: nsubj (subject), ROOT (main verb), dobj (direct object), amod (adjective modifier). Extract SVO: find ROOT → nsubj (in lefts) → dobj (in rights).

10. spaCy vs NLTK

FeaturespaCyNLTK
SpeedFast (C-optimized)Slower (Python)
POS/NER/ParsingBuilt-in, production qualityEducational, needs setup
Best forProduction NLPLearning/research
05
Classical Text Representation
One-Hot, BoW, N-grams, TF-IDF, Sparse vs Dense

1. The Representation Problem

ML models need numerical input. Text is symbolic. The representation hierarchy:

One-Hot BoW N-grams TF-IDF Word2Vec BERT

2. One-Hot Encoding

Vector of size |V| with 1 at word's index. Problems: dimensionality = 50K+, 99.99% sparse, no semantic similarity — cosine("cat","dog") = cosine("cat","table") = 0.

3. Bag-of-Words (BoW)

Document = vector of word counts (order discarded). Uses CountVectorizer(max_features, min_df, max_df, ngram_range). Typically >99% sparse. Limitation: "The dog bit the man" = "The man bit the dog" (word order lost).

4. N-grams

Contiguous sequences of N words. Bigram: "not good" as one feature → captures negation. Tradeoff: bigrams grow vocab ~10×, trigrams ~100×. Performance peaks at bigrams.

5. TF-IDF — Weighted Importance

\(\text{TF}(t,d) = \frac{\text{count}(t,d)}{|d|}\) — how frequent in this doc.
\(\text{IDF}(t) = \log\frac{N}{1 + df(t)} + 1\) — how rare across all docs.
\(\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)\)

WordTFIDFTF-IDFInterpretation
"the"0.150.10.015Very low — appears everywhere
"film"0.052.30.115Medium — domain specific
"brilliant"0.024.50.090High — rare, discriminative

sublinear_tf=True: replaces TF with log(1+TF) — 100× frequency ≠ 100× importance.

6. IMDb Benchmarks (Classical Methods)

86.18%
BoW Accuracy
90.11%
N-grams (1-3)
90.13%
TF-IDF (1+2g)
0.965
TF-IDF AUC-ROC

7. Why TF-IDF > BoW

  • Common words (stopwords) are down-weighted by low IDF
  • Rare, discriminative words are amplified by high IDF
  • Document length normalized (TF is relative, not absolute count)

8. Why N-grams > BoW

  • "not good" as a single feature captures negation (BoW can't)
  • Phrasal patterns captured: specific sentiment bigrams

9. Feature Inspection

For logistic regression: clf.coef_[0] gives weight per feature. High positive → strongly positive sentiment ("brilliant", "excellent"). High negative → strongly negative ("terrible", "awful").

Fundamental limitation of ALL classical methods
NO semantic understanding. "brilliant" and "superb" are treated as unrelated features. No polysemy: "bank" (financial) = "bank" (river) — same vector. Context-independent. Only word embeddings and BERT solve this.
06
Word Embeddings
Word2Vec, GloVe, FastText, Distributional Hypothesis

1. Why Embeddings?

Classical methods are sparse, high-dimensional, semantically blind. Word embeddings produce dense, low-dimensional, semantically meaningful vectors. Famous property: \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\).

2. The Distributional Hypothesis

"Words that appear in similar contexts have similar meanings" (Harris, 1954). If "brilliant" and "superb" appear in the same contexts, their vectors should be close.

3. Word2Vec (Mikolov, 2013)

Trains a shallow neural network on context prediction. Two modes:
CBOW: context → target word (faster).
Skip-gram: target → context (better for rare words).

ParameterMeaningTypical
vector_sizeEmbedding dimension100–300
windowContext words on each side3–5
sg1=Skip-gram, 0=CBOW1
min_countIgnore words below this frequency1–5

Document vector = mean pooling: average of all word vectors in the document. Problem: all words get equal weight — "not" = "film" = "the". This is why TF-IDF often beats Word2Vec on classification.

4. GloVe (Stanford, 2014)

Builds a global co-occurrence matrix X[i,j] = how often word j appears near word i (weighted by 1/distance). Then factorizes: \(X \approx U V^T\) via SVD of log(X). Better at capturing global statistics; Word2Vec better at local patterns.

5. FastText (Facebook, 2016)

Key innovation: subword embeddings. Decomposes word into character n-grams (min_n=3, max_n=6):

"acting" → ["<ac", "act", "cti", "tin", "ing", "ng>", "<acting>"]

\(\vec{\text{word}} = \sum \vec{\text{subword}}\)

OOV solved: even unseen words get a vector via shared subwords. Best for noisy text, morphologically rich languages, rare domain terms.

6. Method Comparison

MethodSemanticContextOOVTraining
Word2Vec❌ (one vector/word)❌ (zero vector)Neural, local windows
GloVeMatrix factorization, global
FastText✅ (subwords)Neural, local + subword
BERT✅ (contextual)Transformer, bidirectional

7. IMDb Benchmarks (Embedding Methods)

85.65%
Word2Vec (100d)
76.80%
GloVe-SVD (5k)
85.95%
FastText (100d)

Why TF-IDF (90.13%) > Word2Vec on IMDb? Mean pooling gives equal weight to all words — "the" and "brilliant" contribute equally. TF-IDF naturally weights by importance. Key sentiment words get diluted by mean pooling.

8. Cosine Similarity

\(\cos(a,b) = \frac{a \cdot b}{\|a\| \times \|b\|}\). Ignores magnitude, focuses on direction. Near-synonyms > 0.8, related words 0.5–0.8, unrelated < 0.3. Better than Euclidean for word vectors (length ≠ strength of meaning).

9. The Decision Tree

Choosing a representation
Rare words/typos/informal text? → FastText
Semantic similarity important? → Word2Vec / GloVe
Very small corpus (<1k docs)? → TF-IDF (embeddings need data)
Always: start with TF-IDF as baseline. Justify complexity with measurable gains.
07
Transformers & Large Language Models
Attention, Multi-Head, BERT, GPT, LoRA, RAG, Prompting

1. Why Transformers? (vs RNNs)

RNNs have two critical limits: (1) sequential bottleneck — step t depends on step t−1, cannot parallelize; (2) vanishing gradients — even LSTM struggles beyond ~100 tokens. Transformers solve both: all positions processed in parallel, self-attention directly connects any two positions in O(1).

2. Scaled Dot-Product Attention

\(\text{Attention}(Q, K, V) = \text{softmax}\!\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V\)

Q (Query): "what am I looking for?" K (Key): "what do I contain?" V (Value): "what do I contribute?"
The \(\sqrt{d_k}\) scaling prevents dot products from growing large and pushing softmax into saturated (near-zero gradient) regions.

3. Multi-Head Attention

\(\text{MHA}(Q,K,V) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O\) where each head = Attention with different learned projections. Each head can focus on different aspects simultaneously: syntax, coreference, semantics.

4. Transformer Encoder Block

Two sub-layers, both with residual + LayerNorm:

Multi-Head Self-Attn Add & Norm FFN Add & Norm

\(\text{output} = \text{LayerNorm}(x + \text{sublayer}(x))\). FFN = two linear layers with ReLU: \(\max(0, xW_1+b_1)W_2+b_2\). Hidden dim typically 4× model dim — most parameters live here.

5. Positional Encoding & Causal Masking

Positional encoding: attention is permutation-invariant (tokens as a set). Add sinusoidal position vectors: \(PE(pos,2i)=\sin(pos/10000^{2i/d})\).
Causal masking: in decoder self-attention, set positions j > i to −∞ before softmax → token i can only see tokens ≤ i. Used in GPT (autoregressive generation).

6. Decoding Strategies

StrategyMechanismProsCons
GreedyPick max prob tokenFast, deterministicOften suboptimal
Beam searchKeep top-k partial seqsHigher qualitySlower, less diverse
Temperature T<1Sharpen distributionMore coherentRepetitive
Temperature T>1Flatten distributionMore creative/diverseLess coherent

7. BERT vs GPT (Architecture)

AspectBERTGPT
ArchitectureEncoder-onlyDecoder-only
TrainingMasked LM (bidirectional)Autoregressive (next token)
ContextSees full sequence (left+right)Sees only past tokens
Best forClassification, extractionGeneration
Parameters110M (base)175B+ (GPT-3)

8. BERT Details

BERT-base: 12 encoder layers, 12 heads, d=768, ~110M params. Max 512 tokens.
Pre-training: Masked LM (predict 15% masked tokens) + Next Sentence Prediction (sentence B follows A?).
Tokenization: WordPiece — "unbelievable" → ["un","##believ","##able"]. Special tokens: [CLS], [SEP], [PAD], [MASK].
Fine-tuning: small LR (2e-5 to 5e-5), warmup steps (500), 3 epochs usually enough. Frozen BERT (feature extraction) often worse than TF-IDF — BERT is designed to be fine-tuned.

9. LLM Training Stages

1. Pre-training 2. SFT 3. RLHF

Pre-training: next-token prediction on trillions of tokens — broad world knowledge.
SFT: Supervised Fine-Tuning on curated instruction–response pairs — teaches following instructions.
RLHF: train reward model on human preferences, then fine-tune LLM with PPO + KL penalty against SFT model — aligns with human values.

10. LoRA (Low-Rank Adaptation)

\(W' = W + \frac{\alpha}{r}AB\) where \(A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times k}\) with \(r \ll \min(d,k)\). Only \(r(d+k)\) trainable params vs \(dk\) for full fine-tuning. For 4096×4096 with r=8: 65K vs 16.8M — a 256× reduction.

11. Prompting Strategies

  • Zero-shot: task description only, no examples. Works on large well-trained models.
  • Few-shot: include k demonstration pairs before query. Guides format + reasoning, no weight updates.
  • Chain-of-Thought (CoT): include step-by-step reasoning in demonstrations (or append "Let's think step by step"). Dramatically improves multi-step reasoning.

12. RAG (Retrieval-Augmented Generation)

Grounds LLM in external documents to reduce hallucination. 5 steps:

1. Index 2. Retrieve 3. Augment 4. Generate 5. Cite

Chunk docs → embed into vector DB → embed query → find top-k similar chunks → prepend to prompt → LLM generates grounded answer → return sources.

13. Hallucination

LLMs generate plausible-sounding but factually incorrect content because they're trained to predict fluent continuations, not verify facts. Mitigations: RAG (ground in documents), RLHF honesty training.

14. The Full IMDb Hierarchy

86.18%
BoW
90.11%
N-grams
90.13%
TF-IDF
85.65%
Word2Vec
85.95%
FastText
93.90%
BERT FT (370ms)

15. The Representation Hierarchy

BoW TF-IDF (+weighting) W2V (+semantic) FastText (+OOV) BERT (+context)

Each level fixes one limitation of the previous. BERT is the only method that has contextual representations — the same word gets different vectors depending on its surrounding words, solving polysemy.

16. Tools & Ecosystem

  • HuggingFace Transformers: unified API for thousands of pre-trained models. AutoTokenizer, AutoModel, pipeline(), Trainer.
  • Ollama: run quantized open-source LLMs locally (4-bit quantization). ollama run llama3. Privacy-preserving, no cloud costs.
  • LangChain: compose LLM pipelines with standard invoke(input) → output interface. Chains prompt templates, LLMs, retrievers, tools.
  • Vector DBs (FAISS, Pinecone, Chroma): fast approximate nearest-neighbor search for dense embeddings. Required for RAG retrieval.

17. Model Selection — 5 Questions

How to choose the right model
1. Task type — classification, generation, or retrieval?
2. Latency/memory constraints? (TF-IDF = 1ms; BERT = 370ms CPU)
3. Is labeled fine-tuning data available?
4. Privacy / data residency requirements? (on-prem? Ollama?)
5. Compute budget for training + inference?

18. The Map That Does Not Expire

Understand why architectures work — the math behind attention, gradient flow, representation learning — not today's model names. Models change yearly; the underlying principles (information bottlenecks, optimization landscapes, inductive biases) are permanent.

ENSAM Casablanca · Deep Learning & NLP · May 2026 · Made with ❤️