ENSAM Casablanca · Deep Learning & NLP 2025/2026

Quick Q&A — All Modules

Same 210 questions, direct answers. Each answer is 1–3 sentences — full meaning, no padding. Click to reveal. Switch to Full Answers for detailed explanations.

Total Questions210
Modules01 → 07
Answer StyleDirect · 1–3 sentences
Exam Date18 May 2026
01 · ANN 02 · CNN 03 · RNN/LSTM 04 · NLP Prep 05 · Classical 06 · Embeddings 07 · Transformers
MODULE 01 ANN & Deep Learning Fundamentals 30 Q
01What is the Universal Approximation Theorem and what does it guarantee about neural networks?

A feedforward network with at least one hidden layer and enough neurons can approximate any continuous function to arbitrary accuracy. It guarantees existence, not how to find the weights or how many neurons are needed.

02What is the vanishing gradient problem and what causes it?

During backprop, sigmoid/tanh derivatives are ≤ 0.25; multiplied across N layers, the gradient approaches zero and early layers stop learning. Fix: ReLU, residual connections (ResNet), batch normalization, LSTM.

03Why is ReLU preferred over sigmoid/tanh in hidden layers?

ReLU gradient = 1 for x > 0 (no saturation, no vanishing gradient), computation is trivial (max(0,x)), and ~50% of neurons are sparse (zero), which is efficient. Sigmoid/tanh saturate and shrink gradients to near zero.

04What is the dying ReLU problem and how does Leaky ReLU fix it?

If a neuron's input is always negative, ReLU always outputs 0 with gradient 0 — the neuron never updates and is permanently dead. Leaky ReLU: $\max(\alpha x, x)$ with $\alpha = 0.01$ keeps a small gradient alive for negative inputs.

05Write the gradient descent weight update formula and explain each term.
$w \leftarrow w - \eta \cdot \dfrac{\partial L}{\partial w}$

$\eta$ = learning rate (step size); $\partial L/\partial w$ = gradient (direction of steepest ascent); subtracting the gradient moves $w$ toward lower loss.

06What is the difference between L1 and L2 regularization?

L1 adds $\lambda\sum|w_i|$ → produces sparse weights (some exactly zero, feature selection). L2 adds $\lambda\sum w_i^2$ → shrinks all weights toward zero but none become exactly zero (weight decay). L2 is the default for deep learning.

07What is dropout and during which phases (train vs inference) is it active?

Dropout randomly sets fraction $p$ of neuron outputs to zero during each training forward pass, forcing the network to learn redundant representations. During inference it is OFF — all neurons are used, outputs scaled by $(1-p)$.

08What is Batch Normalization and what problem does it solve?

BN normalizes each layer's activations to mean 0, std 1 across the mini-batch, then applies learnable $\gamma, \beta$: $\hat{x} = (x-\mu_B)/\sqrt{\sigma_B^2+\epsilon}$, $y=\gamma\hat{x}+\beta$. Solves internal covariate shift — enables faster training, higher learning rates, and reduces sensitivity to initialization.

09What is the difference between Batch GD, Stochastic GD, and Mini-batch GD?

Batch GD: all N samples per update — exact gradient, very slow. SGD: 1 sample — noisy, fast, can escape local minima. Mini-batch (32–256): balanced stability and GPU efficiency — the standard in practice.

10What is Xavier (Glorot) initialization and when should it be used vs He initialization?

Xavier: $W \sim \mathcal{U}[-\sqrt{6}/\sqrt{n_{in}+n_{out}},\, \sqrt{6}/\sqrt{n_{in}+n_{out}}]$ — designed for sigmoid/tanh. He: $W \sim \mathcal{N}(0, \sqrt{2/n_{in}})$ — designed for ReLU. Wrong initialization → vanishing or exploding activations from the first forward pass.

11Write the gradient checking approximation formula and what tolerance is expected.
$\dfrac{\partial L}{\partial w} \approx \dfrac{f(w+\varepsilon) - f(w-\varepsilon)}{2\varepsilon}, \quad \varepsilon \approx 10^{-5}$

Expected relative error < $10^{-7}$ → backprop is correct. Error > $10^{-5}$ → bug in backpropagation implementation.

12What does a training curve where train_loss ↓ but val_loss ↑ indicate, and what should you do?

This is overfitting — the model memorizes training data instead of generalizing. Fix: add Dropout, L2 regularization, data augmentation, EarlyStopping, or reduce model capacity.

13Why use softmax for multi-class output and sigmoid for binary? What is the mathematical relationship?

Sigmoid $\sigma(x)=1/(1+e^{-x}) \in (0,1)$ gives one probability for binary tasks. Softmax $e^{z_i}/\sum_j e^{z_j}$ outputs K probabilities summing to 1 for multi-class. Softmax with 2 classes is mathematically equivalent to sigmoid.

14What role does the chain rule play in backpropagation?

The chain rule decomposes $\partial L/\partial w$ into a product of local gradients: $\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial z}\cdot\frac{\partial z}{\partial w}$. This lets each layer compute its gradient locally and propagate it backward without knowing the full network structure.

15What is the Adam optimizer and what two techniques does it combine?

Adam combines Momentum (1st moment $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$) and RMSprop (2nd moment $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$); update: $w \leftarrow w - \eta\hat{m}_t/(\sqrt{\hat{v}_t}+\epsilon)$. Defaults: $\beta_1=0.9$, $\beta_2=0.999$, $\eta=10^{-3}$.

16What is cross-entropy loss and why is it preferred over MSE for classification?

$\mathcal{L}_{CE} = -\sum_i y_i \log\hat{y}_i$. With sigmoid, the cross-entropy gradient is $(\hat{y}-y)$ — large when wrong, small when correct. MSE + sigmoid saturates at confident-wrong predictions, nearly zeroing the gradient and killing learning.

17What is overfitting, underfitting, and how do you detect each from training curves?

Overfitting: low train loss, high & diverging val loss — model memorizes. Underfitting: high train and val loss roughly equal — model too simple. Good fit: both losses low and close together.

18Why can't all weights be initialized to zero? What is the problem?

All-zero weights → every neuron computes identical output and receives identical gradient → all neurons update identically forever (symmetry problem). A layer of N neurons behaves like 1 neuron — all capacity is wasted. Weights must be randomly initialized to break symmetry.

19What is the purpose of the bias term $b$ in a neuron?

The bias $b$ in $z = \mathbf{w}^T\mathbf{x} + b$ shifts the activation function horizontally, allowing a neuron to fire even when all inputs are zero. Without bias, the decision boundary is forced through the origin, severely limiting what the model can represent.

20What is EarlyStopping and what are the key parameters: patience and restore_best_weights?

EarlyStopping monitors val_loss and halts training when it fails to improve for patience epochs. restore_best_weights=True rolls back to the weights from the epoch with the lowest val_loss, not the final epoch.

21What is the learning rate and what happens if it is too large or too small?

Learning rate $\eta$ controls the weight update step size. Too large → loss oscillates or diverges (overshoots minimum). Too small → training is extremely slow and may get stuck. Adam's default $\eta=10^{-3}$ is a good starting point.

22What is gradient clipping and when is it used?

Gradient clipping caps the gradient norm before the weight update: if $\|\nabla\| > \text{threshold}$, then $\nabla \leftarrow (\text{threshold}/\|\nabla\|)\cdot\nabla$. Used primarily for RNNs/LSTMs where gradients can explode over long sequences. Course lab: GRAD_CLIP = 5.0.

23What is the difference between the validation set and the test set?

Validation: used for hyperparameter tuning and EarlyStopping — influences model selection but not weight updates. Test: touched once at the very end for an unbiased final performance estimate. Using test data to tune hyperparameters is data leakage.

24What activation function is used in the output layer for regression vs binary vs multi-class classification?

Regression: none (linear), MSE loss. Binary: sigmoid, binary cross-entropy. Multi-class (exclusive): softmax, categorical cross-entropy. Multi-label (independent): sigmoid per output, binary cross-entropy per label.

25What is weight decay and how does it relate to L2 regularization?

Weight decay and L2 regularization are mathematically equivalent for standard SGD. The L2 gradient term yields update $w \leftarrow (1-\eta\lambda)w - \eta\partial L/\partial w$ — the factor $(1-\eta\lambda)$ decays the weight at every step, hence "weight decay."

26What is a confusion matrix and what are the four values TP, TN, FP, FN?

TP = correct positive. TN = correct negative. FP = false alarm (predicted positive, actually negative). FN = missed positive (predicted negative, actually positive). Accuracy=(TP+TN)/all; Precision=TP/(TP+FP); Recall=TP/(TP+FN); F1=2PR/(P+R).

27What is the ReduceLROnPlateau callback and what does it do when triggered?

When val_loss doesn't improve for patience=3 epochs, ReduceLROnPlateau multiplies the learning rate by factor=0.5. This allows the optimizer to take finer steps around a plateau. min_lr=1e-6 prevents the LR from going arbitrarily small.

28What is the relationship between batch size and training stability/generalization?

Small batches (8–32): high gradient noise → acts as regularizer, better generalization, slower GPU. Large batches (256+): smooth gradient → faster GPU, sharper minima, worse generalization. Rule: scale LR linearly with batch size. Standard choice: 32–128.

29What is the exploding gradient problem and how is it different from the vanishing gradient?

Vanishing: gradients → 0 from multiplying small numbers (<1) — early layers stop learning. Exploding: gradients → ∞ from multiplying large numbers (>1) — weights blow up (NaN). Fix vanishing: ReLU/ResNet; fix exploding: gradient clipping.

30What is momentum in optimization and what problem does it solve over plain gradient descent?

Momentum accumulates a moving average of gradients: $v_t = \beta v_{t-1} + (1-\beta)g_t$, $\beta=0.9$; update: $w \leftarrow w - \eta v_t$. Smooths zig-zag oscillations in narrow ravines, helps escape shallow local minima, and maintains progress in flat loss regions.

MODULE 02 Convolutional Neural Networks (CNN) 30 Q
01Why can't a plain Fully-Connected (Dense) network efficiently process images?

A 224×224 RGB image has 150,528 inputs — one Dense layer of 1,000 neurons = 150 million parameters (untrainable). FCNs also ignore spatial structure: adjacent pixels are treated as unrelated, while CNNs exploit spatial locality via weight sharing.

02What does a convolutional filter (kernel) compute, conceptually?

A filter slides over the input computing an element-wise dot product at each position: $(\mathbf{f}*\mathbf{x})[i,j] = \sum_{m,n} f[m,n]\cdot x[i+m,j+n]$. Each filter learns to detect a specific pattern (edge, texture) and produces a feature map showing where that pattern occurs.

03Write the formula for the output size of a convolutional layer.
$W_{out} = \lfloor(W_{in} - K + 2P)/S\rfloor + 1$

Example: 224×224, K=3, P=0, S=2 → $\lfloor(224-3)/2\rfloor+1=111$. With P=1 (same), S=1: output = input size.

04What is "same" padding vs "valid" padding?

"same": adds $P=\lfloor K/2\rfloor$ zeros so output size = input size (stride=1) — use to preserve spatial dimensions. "valid": $P=0$, output shrinks by $K-1$ — use to intentionally reduce spatial size.

05How many trainable parameters does Conv2D(32 filters, 3×3 kernel) have on an RGB input (3 channels)?
$(K \times K \times C_{in} + 1) \times C_{out} = (3\times3\times3+1)\times32 = 28\times32 = \mathbf{896}$

The same 896 parameters are reused at every spatial position — this weight sharing is why CNNs are far more parameter-efficient than FCNs.

06What is "weight sharing" in CNNs and why is it important?

The same filter weights are reused at every spatial position of the input. This gives (1) parameter efficiency (896 params process an entire 224×224 image), and (2) translation invariance — a feature detector fires wherever that pattern appears.

07What is max pooling and what are its three benefits?

MaxPool takes the maximum in each 2×2 window (stride 2), halving spatial dims. Three benefits: (1) dimensionality reduction, (2) spatial invariance to small translations, (3) fewer parameters in subsequent layers. It has zero learnable parameters.

08What is Global Average Pooling (GAP) and why is it better than Flatten + Dense?

GAP reduces each $(H,W,C)$ feature map to shape $(C,)$ by taking the spatial average per channel — zero parameters. Flatten+Dense creates $H\times W\times C\times\text{units}$ parameters (88% of VGG's params). Modern networks (ResNet, MobileNet) use GAP to eliminate this explosion.

09Why are two stacked 3×3 conv layers preferred over one 5×5 layer? Give the parameter count.

Same effective receptive field, but two 3×3 = $2\times(9C)=18C$ params vs one 5×5 = $25C$ — a 28% saving. Three 3×3 (=7×7 receptive field) vs one 7×7: $27C$ vs $49C$ — 45% saving. Plus additional non-linearity between the stacked layers.

10What is the filter doubling rule and why does it keep compute constant?

After each MaxPool (spatial ÷2), double the filters: 32→64→128→256→512. When spatial area is halved (÷4) but channels double (×2), the total feature map volume $(H\times W\times C)$ stays roughly constant — deeper layers gain more semantic capacity without exponentially increasing compute.

11What is the power-of-2 compression cascade for a 224×224 input? How many conv blocks does it imply?

224→112→56→28→14→7 (5 halvings). Stop at 7×7 — going further destroys spatial structure needed for classification. This gives 5 conv blocks, exactly the depth of VGGNet.

12What are the key innovations introduced by AlexNet (2012) that enabled modern deep learning?

AlexNet (15.3% top-5 error, won ImageNet 2012) introduced: (1) ReLU (6× faster than tanh), (2) Dropout (p=0.5) as regularization, (3) GPU training (2×GTX 580), (4) data augmentation, (5) Local Response Normalization.

13What is the residual (skip) connection in ResNet and what problem does it solve?

$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$. The identity shortcut lets gradients flow directly backward without passing through activations, solving both the vanishing gradient problem and the degradation problem (adding more layers made networks worse without skip connections). Enables 100+ layer training.

14What is the difference between Feature Extraction and Fine-Tuning in transfer learning?

Feature extraction: freeze pretrained layers, train only new head — fast, needs less data, use when target ≈ ImageNet. Fine-tuning: unfreeze last N layers with very small LR ($10^{-5}$) — adapts features, needs more data, use when target domain differs significantly.

15What is MobileNet's key innovation and why is it suited for mobile devices?

MobileNet uses depthwise separable convolutions: (1) depthwise — one filter per channel (spatial filtering), (2) pointwise 1×1 — mix channels. This uses ~8–9× fewer operations than standard convolutions, enabling real-time inference on mobile CPUs.

16In a VGG-like CNN, what percentage of parameters sit in Dense layers vs Conv layers?

VGG-16: Conv ≈ 14.7M (12%), Dense ≈ 103M (88%). This is why modern architectures replace Flatten+Dense with Global Average Pooling — it eliminates the 88% parameter explosion while matching or improving accuracy.

17What are the three essential training callbacks? Give the purpose and key config for each.

EarlyStopping(patience=5, restore_best_weights=True): stop when val_loss stalls, rollback to best. ModelCheckpoint(save_best_only=True): save best weights to disk. ReduceLROnPlateau(factor=0.5, patience=3, min_lr=1e-6): halve LR when stuck.

18When is Recall more important than Precision? Give a concrete medical AI example.

Recall is more important when False Negatives are costlier than False Positives. Tumor detection: a missed cancer (FN) can be fatal; a false alarm (FP) means an unnecessary biopsy — recoverable. We want high Recall even at the cost of lower Precision.

19What is data augmentation and name four transforms used in image classification?

Data augmentation generates new training samples via label-preserving transforms, reducing overfitting. Common transforms: random horizontal flip, rotation (±15°), brightness/contrast adjustment, random crop/zoom. Never apply transforms that change the label (e.g., don't flip digits 6/9).

20What is Batch Normalization in CNNs — where is it applied and what does it do to feature maps?

In CNNs, BN is applied after conv layers (before or after ReLU), normalizing each channel's activations across the mini-batch to mean 0, std 1. Prevents any single channel from dominating, stabilizes activations across layers, and enables larger learning rates.

21What is stride and how does stride=2 differ from stride=1 with max pooling for downsampling?

Stride = step size between filter positions. Stride=2 halves spatial size by skipping positions. Conv(stride=2) has learnable params and learns what to keep; MaxPool(2×2) always picks the maximum with no learnable params. Modern networks prefer strided conv over pooling.

22What is the Inception module (GoogLeNet) and what problem does it solve?

Inception applies 1×1, 3×3, 5×5 filters and MaxPool in parallel, then concatenates all outputs along the channel dimension. Solves the problem of choosing the right filter size — the network learns which scale is most useful. 1×1 convs before larger filters reduce computational cost.

23What does `include_top=False` mean when loading a pretrained model like VGG16?

Loads only the convolutional backbone, excluding the 1000-class Dense classifier layers. You then add your own classification head for your task — reusing the ImageNet-learned features while training only the new output layers.

24What is EfficientNet's compound scaling strategy?

EfficientNet scales depth ($d=\alpha^\phi$), width ($w=\beta^\phi$), and resolution ($r=\gamma^\phi$) simultaneously using compound coefficient $\phi$, subject to $\alpha\cdot\beta^2\cdot\gamma^2\approx2$ (constant FLOP budget). Result: best accuracy-to-parameter ratio of any architecture at the time.

25What is the F1 Score, when do you use it, and why is it better than accuracy for imbalanced datasets?

$F_1 = 2PR/(P+R)$. Use for imbalanced datasets: a classifier that always predicts the majority class gets 99% accuracy but F1≈0 because Recall for the minority class = 0. F1 forces the model to actually detect both classes.

26What is a 1×1 convolution and what is it used for?

A 1×1 conv applies a linear combination across channels at each spatial position, changing channel count without spatial filtering: $(H,W,C_{in}) \to (H,W,C_{out})$. Used for (1) dimensionality reduction / bottleneck (Inception, ResNet), (2) increasing channels, (3) nonlinear cross-channel mixing.

27What is AUC-ROC and what does an AUC of 0.5 vs 1.0 mean?

ROC plots TPR (Recall) vs FPR across all classification thresholds; AUC = area under this curve. AUC=1.0: perfect classifier. AUC=0.5: random guessing (no discriminative ability). AUC is threshold-independent and handles class imbalance well.

28What is the standard CNN pipeline for image classification (the full forward pass)?
Input(H×W×3) → /255 normalize → [Conv2D(ReLU) → BatchNorm → MaxPool] × N → GlobalAvgPool → Dense(ReLU) → Dropout → Dense(Softmax/Sigmoid) → class probabilities
29What does VGG stand for and what are the two main VGG variants?

VGG = Visual Geometry Group (Oxford, 2014). VGG-16: 13 conv + 3 Dense, 138M params, 7.3% top-5 error. VGG-19: 16 conv + 3 Dense, 144M params. Both use exclusively 3×3 filters — demonstrating that depth beats large filters.

30What is the best practice for normalizing image inputs, and what problems does NOT normalizing cause?

Divide by 255 to scale to [0,1], or use ImageNet stats (mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]) for pretrained models. Without normalization: large activations → gradient instability, poor optimizer conditioning, and mismatch with He/Xavier initialization assumptions.

MODULE 03 Recurrent Networks — RNN · LSTM · GRU 30 Q
01What is the fundamental architectural difference between a Feedforward Network and an RNN?

An FNN processes each input independently with no memory. An RNN maintains a hidden state $h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$ passed between time steps — output depends on the current input and all previous inputs encoded in $h_{t-1}$.

02Write the vanilla RNN hidden state update formula and explain each term.
$h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h)$    $\hat{y}_t = W_y h_t + b_y$

$h_{t-1}$: previous hidden state (memory); $x_t$: current input; $W_h$: recurrent weights; $W_x$: input weights — the same weights are reused at every time step (weight sharing across time).

03What is the vanishing gradient problem in RNNs and why is it worse than in FNNs?

During BPTT, the gradient is multiplied by $W_h^T\cdot\tanh'(\cdot)$ at each step; over T=100+ steps this product → 0 and early steps receive no gradient. Worse than FNNs because RNNs unroll to 100–1000 steps vs only 10–50 layers in typical FNNs.

04What is the exploding gradient problem in RNNs and how is it fixed?

If the largest eigenvalue of $W_h$ > 1, repeated multiplication causes gradients to grow exponentially → NaN weights, training collapses. Fix: gradient clipping — if $\|\nabla\|$ > threshold, scale $\nabla$ down. Course lab: GRAD_CLIP = 5.0.

05What is Backpropagation Through Time (BPTT)?

BPTT unrolls the RNN through time into a deep feedforward graph (one layer per step), then applies standard backprop through all steps computing $\partial\mathcal{L}/\partial W$ where $\mathcal{L}=\sum_t\mathcal{L}_t$. Truncated BPTT backprops only through the last $k$ steps to save memory.

06Name the 4 gates of an LSTM and describe the role of each.

Forget gate $f_t$ (sigmoid): decides what to erase from cell state. Input gate $i_t$ (sigmoid): decides how much new info to write. Cell candidate $\tilde{C}_t$ (tanh): new candidate values. Output gate $o_t$ (sigmoid): what part of cell state to output as $h_t$.

07Write the complete LSTM equations (all 5 equations).
$f_t=\sigma(W_f[h_{t-1},x_t]+b_f)$   $i_t=\sigma(W_i[h_{t-1},x_t]+b_i)$
$\tilde{C}_t=\tanh(W_C[h_{t-1},x_t]+b_C)$   $C_t=f_t\odot C_{t-1}+i_t\odot\tilde{C}_t$
$o_t=\sigma(W_o[h_{t-1},x_t]+b_o)$   $h_t=o_t\odot\tanh(C_t)$

$\odot$: element-wise multiply. $[h_{t-1},x_t]$: concatenation. $C_t$: long-term cell state. $h_t$: short-term working memory / output.

08How does the LSTM cell state $C_t$ solve the vanishing gradient problem?

The cell state is updated additively: $\partial C_t/\partial C_{t-1} = f_t$. When the network needs to remember something, it sets $f_t\approx1$ — gradient flows back unchanged (constant error carousel), enabling learning across hundreds of steps without exponential decay.

09What are the 2 gates of a GRU and how does it differ from LSTM?

GRU has: update gate $z_t$ (merges forget+input) and reset gate $r_t$ (controls how much past to use). No separate cell state — only $h_t$. GRU uses ~25% fewer parameters than LSTM, trains faster, and achieves comparable performance on most tasks.

10What does the forget gate learn to do in language modeling? Give a concrete example.

The forget gate selectively erases cell state when information is no longer relevant. Example: "The cats that were chasing the mouse…" — forget gate keeps "cats" (plural) alive across the relative clause so the model generates the correct plural verb, then erases it after the clause ends.

11What are the four RNN architecture configurations (many-to-one, one-to-many, etc.) with examples?

One-to-many: image captioning. Many-to-one: sentiment classification. Many-to-many (equal length): POS tagging / NER. Many-to-many (seq2seq, different length): machine translation, summarization.

12What is teacher forcing and when is it used?

Teacher forcing feeds the ground-truth previous token as decoder input during training instead of the model's own prediction. Benefits: faster convergence, stable gradients. Risk: "exposure bias" — model sees different distribution at inference. Scheduled sampling gradually reduces teacher forcing.

13What is a Bidirectional RNN and when is it useful?

A BiRNN runs one forward (left→right) and one backward (right→left) RNN; concatenates hidden states at each position. Useful when future context matters (NER, POS tagging, BERT). Cannot be used for real-time generation where the full sequence is not available upfront.

14What is a stacked (deep) RNN and what does each layer learn?

Stacked RNN: output sequence of layer L becomes input to layer L+1. Layer 1 learns low-level patterns (word-level), layer 2 learns phrase-level structures, layer 3 learns sentence-level semantics. Course lab: NUM_LAYERS=2. Typical depth: 2–4 layers.

15What is the encoder-decoder (seq2seq) architecture and what information passes between them?

Encoder reads the full input and compresses it to a context vector $c = h_T^{enc}$ (final hidden state). Decoder initializes with $c$ and generates output token by token. Bottleneck: long inputs lose information compressed into one vector — solved by Attention (direct access to all encoder hidden states).

16What does the hidden size hyperparameter in RNNs control?

Hidden size $d_h$ is the dimensionality of $h_t$ — it controls memory capacity. Too small (16): underfitting. Good (64–512): balanced. Too large (1024+): overfitting risk, slower training. Course lab: HIDDEN_SIZE = 64 for temperature forecasting.

17Why are RNNs fundamentally slower to train than CNNs or Transformers?

RNNs are sequential: $h_t$ requires $h_{t-1}$ — time steps cannot be parallelized. CNNs process all spatial positions in parallel; Transformers compute all positions simultaneously via matrix multiplication. This sequential bottleneck is why Transformers replaced RNNs for long-sequence tasks.

18What is the long-range dependency problem and what is the maximum effective range of a vanilla RNN?

Vanilla RNNs can only reliably remember ~5–10 past time steps due to vanishing gradients. LSTMs extend this to hundreds of steps. Transformers handle thousands of tokens equally regardless of distance via direct attention — O(1) path between any two positions.

19What does the sequence length (SEQ_LEN) hyperparameter control in a time series RNN?

SEQ_LEN is how many past steps the model sees to predict the next step. Course lab: SEQ_LEN=30 (30 days → predict day 31). Too short: misses relevant long-term patterns. Too long: harder to train, more memory, greater vanishing gradient risk.

20Compare LSTM vs GRU vs Vanilla RNN for the task of long-form text generation.

Vanilla RNN: poor long-range, only for very short sequences. GRU: good long-range, ~25% fewer params than LSTM, faster — use when speed matters. LSTM: best long-range, most expressive — use for complex long sequences. In practice, prefer LSTM or GRU over vanilla RNN.

21What is the difference between a regression RNN and a classification RNN at the output layer?

Regression: linear (no) output activation, MSE/MAE loss, continuous output $\hat{y}\in\mathbb{R}$ (e.g., temperature prediction). Classification: softmax (multi-class) or sigmoid (binary) activation, cross-entropy loss, probability output $\hat{y}\in[0,1]$.

22What is the conceptual difference between the cell state $C_t$ and the hidden state $h_t$ in LSTM?

$C_t$ = long-term memory: updated additively, can carry info over hundreds of steps, internal only (never used as output). $h_t$ = working memory: computed through the output gate + tanh($C_t$), used as the output passed to Dense layers and as input to the next step.

23What is the purpose of the LEARNING_RATE=1e-3 and how does it interact with gradient clipping?

LR=1e-3 controls step size; gradient clipping controls maximum gradient norm. Both are needed: clipping prevents catastrophically large updates when gradients explode, while LR prevents overshooting even after clipping has scaled the gradient down.

24What is an embedding layer and why is it placed before the RNN in text models?

An embedding layer maps integer token indices to dense vectors $\mathbf{e}_i \in \mathbb{R}^d$. Placed before RNN because RNNs need dense continuous inputs; raw one-hot vectors are 50k-dimensional, sparse, and semantically meaningless. Embedding weights are learned end-to-end or initialized with pretrained vectors.

25What is the difference between return_sequences=True and return_sequences=False in Keras LSTM?

return_sequences=False: returns only last $h_T$, shape (batch, hidden) — for many-to-one tasks (sentiment). return_sequences=True: returns all $h_t$, shape (batch, seq, hidden) — required for stacked LSTMs and sequence labeling. Intermediate stacked layers must use True.

26What is the "constant error carousel" property of LSTM and why is it important?

When $f_t=1$ and $i_t=0$: $C_t=C_{t-1}$ exactly, and the gradient flows back with factor $f_t=1$ — no decay. This is the mechanism that allows LSTM to sustain gradients across hundreds of steps, enabling long-range dependency learning that vanilla RNNs cannot achieve.

27What synthetic time series components are typically used in RNN temperature forecasting labs?

The course lab uses three stacked components: (1) $A\sin(2\pi t/365)$ — annual seasonality, (2) a weekly periodic variation, (3) Gaussian noise. The model (Vanilla RNN, SEQ_LEN=30) must learn to decompose and extrapolate these components to predict day 31.

28Why is tanh used in RNNs (for hidden states) rather than ReLU?

Tanh outputs are bounded in $[-1,1]$, preventing hidden states from growing unboundedly across time steps (ReLU would cause exponential state growth). Tanh is also zero-centered, giving better gradient flow than sigmoid. Empirically, ReLU in vanilla RNNs tends to cause exploding hidden states.

29What is the PATIENCE=10 parameter in the RNN lab's EarlyStopping and why is it higher than in CNN labs?

PATIENCE=10 (vs 5 in CNN labs) because RNN training is noisier: higher gradient variance from sequential dependencies causes val_loss to fluctuate more and plateau temporarily. A higher patience avoids stopping too early during a genuine (but bumpy) improvement phase.

30Summarize: what is the key reason to choose LSTM over vanilla RNN for sequence tasks?

LSTM solves the vanishing gradient problem through its additive cell state update: $\partial C_t/\partial C_{t-1}=f_t$, which can be ≈1. Vanilla RNN multiplies $W_h^T\cdot\tanh'(\cdot)$ at every step → exponential decay. Use vanilla RNN only for sequences ≤10 steps; use LSTM or GRU for all real tasks.

MODULE 04 NLP Preprocessing Pipeline 30 Q
01List the 6 standard steps of the NLP text cleaning pipeline in order.
(1) text.lower() (2) re.sub(r'<[^>]+>', ' ', text) # remove HTML (3) re.sub(r'http\S+|www\S+', ' ', text) # remove URLs (4) re.sub(r'[^\w\s]', ' ', text) # remove punctuation (5) re.sub(r'\d+', ' ', text) # numbers: optional (6) re.sub(r'\s+', ' ', text).strip() # normalize whitespace
02What is tokenization and what are the main types?

Tokenization splits text into discrete units: character (~100 vocab, no OOV), word (50k–500k vocab, poor OOV → UNK), subword/BPE/WordPiece (30k–50k controlled vocab, excellent OOV). BERT uses WordPiece; GPT uses BPE.

03What is the difference between stemming and lemmatization? Give examples of each.

Stemming: heuristic suffix-chopping — fast but may produce non-words ("studies"→"studi"). Lemmatization: dictionary lookup + POS context — always a valid base form ("better"→"good", "running"+"v"→"run"). Stemming is faster; lemmatization is more accurate.

04What are stop words and why should "not", "no", "never" NOT be removed for sentiment analysis?

Stop words are high-frequency function words ("the", "a", "is") with little standalone meaning. Removing negation words ("not", "no", "never") for sentiment analysis completely flips polarity: "not good" → "good" (false positive), "I have no complaints" → "complaints" (false negative).

05What is Part-of-Speech (POS) tagging and what does it enable downstream?

POS tagging assigns grammatical categories (NN, VB, JJ, RB) to each token. Enables: accurate lemmatization (need POS to correctly lemmatize verbs vs nouns), NER, dependency parsing, chunking, and word sense disambiguation.

06What is Named Entity Recognition (NER) and what are the standard entity types?

NER identifies and classifies named entities: PERSON (Barack Obama), ORG (Microsoft), GPE (Casablanca), DATE (May 18, 2026), MONEY ($500). Libraries: spaCy, NLTK maxent_ne_chunker, HuggingFace token classifiers.

07What is the OOV (Out-of-Vocabulary) problem and how do different tokenization strategies handle it?

OOV = token not in training vocabulary. Word-level: maps to [UNK], loses all info. Character: no OOV possible. BPE/WordPiece: breaks rare words into known subword pieces ("unhappy"→["un","##happy"]). FastText: builds OOV vector from character n-gram subwords of known words.

08What is BPE (Byte Pair Encoding) tokenization and how does it build its vocabulary?

BPE starts with a character-level vocabulary, then iteratively merges the most frequent adjacent token pair until the target vocabulary size is reached. Common words stay whole; rare words split into known subword pieces. Used by GPT-2/3/4 and RoBERTa.

09What is dependency parsing and what information does it provide?

Dependency parsing identifies directed grammatical relationships between words (nsubj, dobj, det): "The cat ate the fish" → ate←nsubj←cat, ate→dobj→fish. Applications: information extraction, question answering, coreference resolution, relation extraction.

10What is a Context-Free Grammar (CFG) and write a simple example?

A CFG is a set of recursive rewrite rules defining valid syntactic sentence structures: S→NP VP, NP→Det N, VP→V NP, Det→"the"|"a", N→"cat"|"fish", V→"ate". "The cat ate a fish" parses as S→NP VP→Det N V NP→the cat ate a fish.

11How does NLTK's word_tokenize() differ from Python's str.split()?

"I don't like it, really!" → split(): ["I", "don't", "like", "it,", "really!"] (punctuation attached). word_tokenize(): ["I", "do", "n't", "like", "it", ",", "really", "!"] — splits contractions and separates punctuation using Penn Treebank conventions.

12What is the vocabulary size problem and why is it challenging?

Too small vocab (5k): high OOV rate. Too large vocab (500k): enormous embedding matrix, slow softmax, sparse training per word. Standard: 30k–50k for word-level. Subword methods (BPE/WordPiece) solve this by representing rare words as combinations of frequent subword pieces.

13When should you keep numbers in the text? Give two domain examples where removing them hurts performance.

Keep numbers for: (1) financial text — "Revenue increased 23%" (the percentage is the key signal), (2) medical text — "BMI 32.5, BP 140/90" (numbers carry diagnostic significance). Remove for general sentiment analysis where years and durations are noise.

14What does spaCy's `en_core_web_sm` model provide and what are its capabilities?

en_core_web_sm (~12MB) provides: tokenization, POS tagging (token.pos_), dependency parsing (token.dep_, token.head), NER (doc.ents with entity labels), and lemmatization (token.lemma_). Larger models (md, lg) additionally include word vectors.

15What is the challenge of tokenizing social media text (Twitter, Reddit)?

Standard tokenizers fail on: hashtags (#DeepLearning → splits at #), @mentions, emojis (often UNK or garbled), slang/abbreviations ("lol", "gonna" → OOV), and repeated chars ("sooooo" → rare token). Use TweetTokenizer (NLTK) or BERTweet for social media text.

16What is sentence segmentation and why is it non-trivial?

Sentence segmentation splits documents into sentences before word tokenization. Non-trivial because periods appear in abbreviations ("Dr.", "U.S.A."), decimals ($3.99), and ellipsis — not just as sentence boundaries. NLTK's Punkt tokenizer uses statistical models to distinguish these cases.

17What is the difference between Porter Stemmer and Lancaster Stemmer?

Porter: moderate aggressiveness, readable output ("generously"→"generous") — the most widely used English stemmer. Lancaster: more aggressive, faster, often incomprehensible output ("generously"→"gen"). Use Porter by default; Lancaster only when speed is critical and readability doesn't matter.

18What is coreference resolution and why is it important for information extraction?

Coreference resolution links multiple expressions that refer to the same entity: "Obama", "He", "the president" → all Obama. Without it, QA cannot answer "Who signed the law?" (the answer is in a pronoun), and summarization loses the connection between facts.

19What is word sense disambiguation (WSD) and why is it hard?

WSD determines which meaning of a polysemous word is intended: "bank" = financial institution vs river bank. Hard because many words have dozens of senses (WordNet: "run" has 39), sense boundaries are fuzzy, and rare senses have few training examples. BERT implicitly solves WSD via contextual embeddings.

20What regex pattern removes HTML tags and why is this an important preprocessing step for web-scraped data?
re.sub(r'<[^>]+>', ' ', text) # "<br/>Hello <b>world</b>" → " Hello world"

[^>]+ matches everything between < and >. Replacing with space (not empty string) prevents word merging. Necessary for IMDb/Wikipedia/news data where HTML tags create OOV tokens and break tokenization.

21What NLP libraries are used in the course and what is each one's specialty?

NLTK: classical NLP (stemming, lemmatization, POS, NER chunker). spaCy: industrial-strength NLP (fast POS, NER, dep parsing). scikit-learn: ML pipelines (CountVectorizer, TfidfVectorizer, classifiers). gensim: word embeddings (Word2Vec, GloVe). HuggingFace: pretrained Transformers (BERT, GPT).

22What is morphological analysis and how does it differ from stemming?

Morphological analysis decomposes words into morphemes (smallest meaning units): "unhappiness" → [un- (negation) + happy (root) + -ness (nominalization)]. Unlike stemming (crude heuristic suffix-chopping), it identifies actual functional components. Essential for morphologically rich languages (Arabic, Turkish, Finnish).

23What is the full clean_text() pipeline function from the course?
def clean_text(text): text = text.lower() text = re.sub(r'<[^>]+>', ' ', text) text = re.sub(r'http\S+|www\S+', ' ', text) text = re.sub(r'[^\w\s]', ' ', text) # text = re.sub(r'\d+', ' ', text) # optional text = re.sub(r'\s+', ' ', text).strip() return text
24What is the NLP text classification pipeline end-to-end (from raw text to model output)?
Raw Text → clean_text() → tokenize → [stop words / stem / lemma] → vectorize (BoW / TF-IDF / Embedding) → model (LR / LSTM / BERT) → predict → evaluate (accuracy, F1)
25Why is lowercasing the very first step in the preprocessing pipeline?

Lowercasing first: (1) reduces vocab ("Apple"="apple"="APPLE" → one token), (2) sentence-starting caps are non-semantic, (3) stemmers/lemmatizers expect lowercase. Exception: if NER is needed, do it before lowercasing because capitalization signals named entities.

26What NLTK corpora need to be downloaded for a complete NLP pipeline?
nltk.download('punkt') # tokenizer nltk.download('stopwords') # stop word lists nltk.download('wordnet') # lemmatizer dictionary nltk.download('averaged_perceptron_tagger') # POS tagger nltk.download('maxent_ne_chunker') # NER chunker nltk.download('words') # English word list
27What is the difference between lemmatization using NLTK WordNetLemmatizer with and without POS tag?

Without POS (defaults to noun): lemmatize("running")→"running" (wrong!), lemmatize("better")→"better" (wrong!). With POS: lemmatize("running", pos='v')→"run", lemmatize("better", pos='a')→"good". Always combine with POS tagging for accurate lemmatization.

28What is chunking in NLP and how does it differ from full parsing?

Chunking (shallow parsing) groups words into flat phrase chunks (NP, VP) without a full parse tree — fast, partial coverage. Full parsing produces a complete hierarchical tree — slower, complete. Use chunking for NER and information extraction; full parsing for grammar checking or machine translation.

29What is whitespace normalization and what problems does it fix?

re.sub(r'\s+', ' ', text).strip() collapses all whitespace (spaces, \n, \t, multiple spaces) to single spaces and strips leading/trailing. Needed because previous substitutions (replacing tags/URLs with spaces) often create multiple consecutive spaces that confuse tokenizers.

30Why is text preprocessing domain-specific? Compare the pipeline for medical records vs social media.

Medical: keep numbers (doses = critical), careful lowercasing (drug names case-sensitive), expand abbreviations, lemmatize. Social media: handle hashtags/emojis/slang, remove numbers, use specialized tokenizer. There is no universal pipeline — always design for your domain and task.

MODULE 05 Classical Text Representation — BoW · N-grams · TF-IDF 30 Q
01What is the Bag of Words (BoW) model and what information does it explicitly discard?

BoW represents a document as a word count vector over a fixed vocabulary, explicitly discarding: word order ("dog bites man" = "man bites dog"), syntax, context, and semantic relationships ("car" ≠ "automobile").

02Write the BoW vector for "I love cats" with vocabulary ["cats","dogs","hate","love"].

Vocab (sorted): cats=0, dogs=1, hate=2, love=3. "I love cats" contains cats(×1), love(×1); "I" is OOV.

$\text{BoW}(\text{"I love cats"}) = [1, 0, 0, 1]$

Compare: "I hate dogs" → [0,1,1,0] — BoW correctly separates opposite sentiments here.

03What is the IMDb 50k benchmark accuracy of BoW with unigrams?

86.18% accuracy on IMDb 50k binary sentiment. This is the baseline for all classical methods — remarkably strong despite ignoring all word order, because sentiment-bearing words ("excellent", "terrible") carry strong signal on their own.

04Why does BoW fail on "The movie was not good, not bad"? What is this limitation called?

BoW counts "good" and "bad" independently of "not" — it cannot model that "not" negates the adjacent adjective. The neutral/mixed sentiment produces the same or similar vector as a positive or negative review. This is the negation blindness problem.

05What is an N-gram? Define unigram, bigram, and trigram with examples.

An N-gram is a contiguous sequence of N tokens. "The cat sat": unigrams=["The","cat","sat"], bigrams=["The cat","cat sat"], trigrams=["The cat sat"]. N-grams capture local word order, enabling detection of "not good" as a single negative feature.

06Give the exact IMDb benchmark results for unigram, bigram, and trigram N-gram models.

Unigram (1,1): 86.18%. Bigram (1,2): ~88–89%. Trigram (1,2,3): 90.11%. Each level adds local context, improving negation handling and phrase-level sentiment capture (+2–3% per step).

07Write the TF-IDF formula and explain each component.
$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \left[\log\!\left(\dfrac{N}{df(t)}\right) + 1\right]$

TF: how often $t$ appears in document $d$ (local importance). IDF: $\log(N/df(t))$ penalizes words common across many documents ("the" → IDF≈0). High weight = frequent in this doc AND rare across corpus = distinctive.

08What is the IDF of a word that appears in ALL N documents? What does this mean?

IDF = $\log(N/N)+1 = 1$ (with smoothing), or 0 without smoothing. A word appearing in every document ("the") is not distinctive — TF-IDF assigns it near-zero weight regardless of how often it appears in a single document.

09What is the IMDb accuracy of TF-IDF with (1,2)-grams and why does it outperform raw BoW?

90.1% on IMDb 50k — best classical method. Outperforms raw BoW (86.18%) because IDF down-weights uninformative common words ("the", "is" → weight≈0), and bigrams capture negation patterns ("not good", "highly recommend").

10What is the sparsity problem in BoW/TF-IDF and what dimensionality does it produce on IMDb?

IMDb 50k corpus: ~100k unique words → each document is a 100k-dimensional vector where 99.5%+ of entries are zero. Problems: memory (requires sparse matrix format), curse of dimensionality (distances become uninformative), and no semantic generalization ("car"⊥"automobile").

11What is the dimensionality of a BoW vector and what determines it?

Dimensionality = vocabulary size $|V|$, one dimension per unique word (or N-gram). Controlled by max_features — e.g., CountVectorizer(max_features=10000) keeps the 10k most frequent. Bigram vocabulary can reach millions without max_features truncation.

12What is CountVectorizer vs TfidfVectorizer in scikit-learn and what does each output?

CountVectorizer: outputs integer word counts (raw BoW). TfidfVectorizer: outputs float TF-IDF scores — common words get low score (IDF≈0), rare words get high score. Use CountVectorizer for BoW baseline; TfidfVectorizer for classification and retrieval tasks.

13What is the difference between binary BoW and count BoW?

Binary BoW (binary=True): 0 or 1 — does the word appear? Ignores repetition. Count BoW (default): actual occurrence count. For IMDb, count BoW performs better — repetition ("amazing amazing!") genuinely signals strong sentiment intensity.

14When should you STOP at classical methods and not use neural networks? Give 4 conditions.

Stop at classical (TF-IDF + Logistic Regression) when: (1) <10k samples (neural overfits), (2) interpretability required (legal/medical), (3) latency <1ms (neural is too slow), (4) TF-IDF already meets accuracy requirements (the 4% gain from BERT costs 370× more latency).

15What does the `sublinear_tf=True` parameter do in TfidfVectorizer?

Replaces raw TF with $1+\log(\text{TF})$: a word appearing 10× → $1+\log(10)=3.3$ instead of 10. Compresses the frequency range so high-repetition doesn't dominate — a word appearing 100× is not 100× as important as one appearing once. Recommended for most TF-IDF applications.

16What is the ngram_range=(1,2) parameter in scikit-learn vectorizers?

ngram_range=(min_n, max_n) extracts all N-grams in that range. (1,2): unigrams + bigrams — "I love this" → ["I","love","this","I love","love this"]. (1,3): uni+bi+trigrams. Larger ranges add context but exponentially increase vocabulary size.

17What are the 4 shared limitations of BoW, N-gram, and TF-IDF representations?

(1) No semantic meaning: "car"⊥"automobile" (orthogonal vectors). (2) No context/polysemy: "bank" = same vector regardless of meaning. (3) OOV problem: new words at inference → zero vector. (4) High-dimensional sparsity: 100k dims, 99.5% zeros — curse of dimensionality. All solved by neural embeddings.

18What is a document-term matrix and how is it structured?

$\mathbf{X} \in \mathbb{R}^{N_{docs}\times|V|}$: rows = documents, columns = vocabulary terms, entry $X_{ij}$ = count or TF-IDF score. Stored as scipy sparse matrix (csr_matrix) because dense format (50k docs × 100k words × 4 bytes = 20GB) is infeasible to hold in RAM.

19Why do bigrams outperform unigrams on sentiment tasks specifically?

Sentiment is driven by negation and degree adverbs that require exactly 2-word context: "not good" (negative), "highly recommend" (positive), "absolutely terrible" (very negative). Unigrams see each word in isolation — "not" and "good" each have ambiguous sentiment. Bigrams capture the combination directly.

20What does max_features=10000 do in CountVectorizer and what are the trade-offs?

Keeps only the 10,000 most frequent terms, discarding rarer ones. Too small (1k): misses important domain words. No limit: extreme sparsity, slow training, overfitting risk. Standard: 10k–50k for BoW, 50k–100k for TF-IDF + N-grams.

21What is the co-occurrence matrix and how does it relate to word representations?

$C[i,j]$ = number of times word $i$ and word $j$ appear within a context window across the corpus. GloVe is based on factorizing the log co-occurrence matrix — SVD/factorization compresses the $|V|\times|V|$ sparse matrix into dense word vectors. Problem: 10 billion entries for $|V|=100k$.

22What is the polysemy problem in classical text representation? Give a concrete example.

Classical representations assign one feature dimension per word — "bank" (financial) and "bank" (river) share the same counter. A classifier cannot distinguish which meaning is intended. Solved by contextual embeddings (BERT) which produce different vectors for "bank" in each context.

23What is the key reason TF-IDF outperforms raw BoW for document classification?

IDF solves the "common word dominance" problem: "the" appearing 100× dominates "excellent" appearing 3× in raw BoW. TF-IDF assigns "the" weight≈0 (IDF≈0) and "excellent" weight≈24 (high IDF × count) — the classifier focuses on discriminative words rather than function words.

24How do you compute TF (term frequency) for a document? Are there different variants?

Variants: raw count $f_{t,d}$, normalized $f_{t,d}/\sum f$, sublinear $1+\log(f)$ (recommended — best for most TF-IDF), binary (1 if present, 0 otherwise). scikit-learn TfidfVectorizer applies L2 row normalization after computing TF-IDF scores.

25What is the typical machine learning classifier paired with TF-IDF vectors and why?

Logistic Regression or Linear SVM. Why linear: high-dimensional sparse vectors are often linearly separable (a 100k-dim boundary is very expressive), training is computationally efficient, inference is fast, and coefficients directly show which words are most predictive.

26What is the production use case where TF-IDF excels over neural methods?

Information retrieval / search engines: rank documents by cosine similarity to query, where high-IDF terms (rare but query-relevant) rank highest. BM25 (used by Elasticsearch) is the modern improved TF-IDF variant. TF-IDF also excels for keyword extraction and document deduplication.

27What is the decision ladder — when to escalate from BoW → N-gram → TF-IDF → embeddings?

BoW → N-grams if accuracy insufficient. N-grams → TF-IDF if common words dominate. TF-IDF → Word2Vec/FastText if need >91%, have >50k samples. FastText/W2V → BERT if need >93%, have GPU. Golden rule: start simple — escalate only when requirements aren't met.

28What is the vocabulary explosion problem with N-grams and how is max_features used to control it?

Unigrams ≈50k. Bigrams: potentially 1.25 billion pairs (most rare). Trigrams: astronomically large. max_features=100000 keeps only the most frequent N-grams — rare N-grams are noise, frequent ones capture real patterns. Essential for N>2.

29Why can't BoW/TF-IDF capture semantic similarity between "good" and "excellent"?

"good" and "excellent" each occupy a different one-hot dimension — their cosine similarity = 0 (orthogonal vectors). Despite sharing positive sentiment, the classifier must learn them independently from scratch with no information sharing. This motivates word embeddings where similar words have similar vectors.

30Summarize the performance comparison of all classical methods on IMDb 50k.

BoW unigram: 86.18%. N-gram bigram: ~88–89%. N-gram trigram: 90.11%. TF-IDF (1,2)-gram: 90.1%. TF-IDF is the best classical method overall — the right stopping point before investing in neural approaches for most production NLP tasks.

MODULE 06 Word Embeddings — Static & Contextual 30 Q
01State the distributional hypothesis. Why is it the foundation of word embeddings?

"A word is characterized by the company it keeps" (Firth, 1957). Words in similar contexts have similar meanings. This allows learning meaning purely from co-occurrence statistics — no manual annotation needed, just raw text at scale.

02What problem do dense word embeddings solve compared to sparse BoW vectors?

BoW: 50k–100k dimensional, 99.5% sparse, no semantic relationship between words. Embeddings: 100–300 dimensional, dense, cosine similarity captures semantics — "car" and "automobile" have high similarity (>0.8) instead of being orthogonal vectors.

03What is Word2Vec Skip-gram? Describe the training objective.

Given a center word, predict surrounding context words within a window. Trained to maximize $P(\text{context}|\text{center})$ via negative sampling — updating only k=5–20 random negative words per positive pair instead of the full vocabulary. Works better for rare words (each gets many training signals).

04What is Word2Vec CBOW? How does it differ from Skip-gram?

CBOW: given surrounding context words, predict the center word (opposite of Skip-gram). Faster training (one prediction per window vs many per word). Skip-gram is generally preferred for quality; CBOW for speed on large corpora.

05Write the famous Word2Vec vector arithmetic result and explain what it demonstrates.
$\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}$

Demonstrates that embeddings encode linear analogical relationships: the "royalty" offset is consistent regardless of gender. First clear evidence that distributed representations encode structured semantic knowledge. Also: Paris − France + Germany ≈ Berlin.

06Write the cosine similarity formula and explain what it measures.
$\cos(\mathbf{u},\mathbf{v}) = \dfrac{\mathbf{u}\cdot\mathbf{v}}{\|\mathbf{u}\|\cdot\|\mathbf{v}\|}$

Measures the angle between vectors regardless of magnitude. Range $[-1, +1]$: +1 = identical direction (same meaning), 0 = orthogonal (no relationship), −1 = opposite (antonyms). Scale-invariant — preferred over Euclidean distance for embeddings.

07What are the exact IMDb 50k benchmark results for Word2Vec, GloVe, FastText, and BERT?

BoW baseline: 86.2%, ~1ms. Word2Vec + mean-pool: 85.7%, 12ms. GloVe + mean-pool: 76.8%, 10ms. FastText: 86.0%, 15ms. BERT fine-tuned: 93.9%, 370ms CPU.

08Why does GloVe perform WORSE than BoW on IMDb sentiment (76.8% vs 86.2%)?

Two causes: (1) GloVe's global co-occurrence statistics can't distinguish "not good" from "very good" — both have the same global counts. (2) Mean-pooling over sentence vectors destroys word order, losing negation context. BoW compensates because a trained classifier learns that "not"+"good" together = negative.

09How does GloVe differ from Word2Vec in its training approach?

Word2Vec: predictive neural network, local context window (5–10 words), can stream data. GloVe: count-based matrix factorization of global co-occurrence statistics, requires full corpus upfront. Both produce similar quality embeddings for most tasks.

10How does FastText handle Out-of-Vocabulary (OOV) words? Write the subword decomposition of "acting".

FastText represents each word as the average of its character n-gram embeddings (n=3–6). "acting" → [<ac, act, cti, tin, ing, ng>]. OOV word "unknownterm" = sum of its n-grams which overlap with known words — Word2Vec/GloVe return zero for OOV.

11What is the polysemy problem in static embeddings? Give two meanings of one word.

Static embeddings assign one fixed vector per word regardless of context. "bank" (financial institution) and "bank" (river bank) share the same vector — a blended average of both meanings, accurate for neither. Contextual embeddings (BERT) produce different vectors per context.

12What does BERT stand for and what are its two pre-training tasks?

BERT = Bidirectional Encoder Representations from Transformers (Google 2018). Pre-training: (1) MLM — predict 15% masked tokens using both left and right context → bidirectional representations. (2) NSP — predict whether sentence B actually follows sentence A → discourse understanding.

13What is the [CLS] token in BERT and how is it used for classification?

[CLS] is prepended to every input sequence; after 12 Transformer encoder layers its final hidden state = sentence-level representation that has attended to all tokens. This vector (shape: batch×768) is passed to a Dense classification head and the entire model is fine-tuned end-to-end.

14Compare BERT-Base and BERT-Large: layers, attention heads, hidden dimension, parameters.

BERT-Base: L=12, H=768, A=12, 110M params. BERT-Large: L=24, H=1024, A=16, 340M params. Base is standard for most applications; Large gives ~1–2% better accuracy at ~3× more memory and compute.

15What is BERT's CPU inference latency on IMDb and why does this matter for production?

370ms per sample on CPU. At 100 req/s, BERT-CPU needs 37 CPUs vs FastText needing 1.5. Production SLA <200ms requires GPU (~20–30ms) or DistilBERT (40% faster, 97% accuracy retention). The 4% accuracy gain over TF-IDF costs 370× higher latency.

16What is the "golden rule" for choosing a text representation method?

Start simple. Add complexity only when accuracy requires it. Ladder: TF-IDF → FastText → BERT. The 4% gain (TF-IDF 90% → BERT 94%) comes at 370× higher latency — always evaluate whether that trade-off is justified for your specific application.

17When would you choose FastText over BERT? Give 3 conditions.

Choose FastText when: (1) latency <50ms required (BERT = 370ms CPU), (2) morphologically rich language or OOV-heavy domain (FastText handles OOV via subwords; BERT fragments poorly), (3) no GPU or small dataset (FastText trains in seconds on CPU; BERT needs hours and a GPU).

18What is mean-pooling of word embeddings and why is it insufficient for sentiment?

Mean-pooling: $\mathbf{d} = (1/T)\sum_t \mathbf{e}_t$. Insufficient for sentiment because: (1) destroys word order ("not good" = "good not"), (2) dilutes negation (mixed reviews average to neutral vector), (3) all words weighted equally ("the" = "brilliant"). This is why Word2Vec+mean-pool (85.7%) underperforms BoW (86.2%).

19What is the [SEP] token in BERT and when is it used?

[SEP] marks end of sequence or boundary between two input sentences: [CLS] sentence A [SEP] sentence B [SEP]. Used with segment embeddings (0 for sentence A, 1 for B) so BERT knows which tokens belong to which input — required for NSP, QA, and NLI tasks.

20What does fine-tuning BERT mean in practice? What is updated during fine-tuning?

Fine-tuning updates all 110M pretrained weights using task-specific labeled data, with very small LR (2e-5 to 5e-5) for 2–4 epochs to avoid destroying pretrained representations. A task-specific Dense head is added on top of the [CLS] token embedding.

21What is the typical dimensionality of Word2Vec embeddings and what range is common in practice?

Typically $d=100$–$300$ dimensions. Original Google News Word2Vec uses $d=300$. GloVe 840B also uses d=300. BERT uses d=768 (Base) or d=1024 (Large) — larger because it encodes contextual information, not just lexical meaning.

22What is the 6-review demo result: how many of 6 test reviews does each method classify correctly?

BoW: 5/6 (fails on negation). TF-IDF: 6/6. Word2Vec + mean-pool: 4/6 (fails on mixed reviews). FastText: 4/6. BERT: 6/6. Surprise: TF-IDF matches BERT on this demo; Word2Vec underperforms BoW.

23Describe the three families of text representation in a single comparison.

Classical (BoW/TF-IDF): sparse counts, no semantics, ~1ms, 90.1% IMDb. Static embeddings (W2V/GloVe): dense fixed vectors per word, distributional semantics, one vector per word, 10–15ms, 85.7%. Contextual (BERT): dense per-token per-context vectors (768-dim), deep semantics, 370ms, 93.9%.

24What is the 30-year evolution of NLP text representations?

1990s: BoW/TF-IDF. 2000s: LSA/LDA (topic models). 2013: Word2Vec (neural embeddings, analogy arithmetic). 2014: GloVe (global co-occurrence factorization). 2016: FastText (subword, OOV robustness). 2018: BERT (contextual, bidirectional Transformer). 2020+: GPT-3/T5/LLaMA (generative, massive scale).

25What is negative sampling in Word2Vec and why is it necessary?

Full softmax over $|V|=100k$ vocabulary is O(|V|) per step — computationally prohibitive. Negative sampling instead updates only $k=5$–$20$ randomly sampled "negative" words per positive (center, context) pair, reducing cost to O(k). Makes Word2Vec training practical on large corpora.

26What is gensim's Word2Vec API? Write the training call and how to get a word vector.
model = Word2Vec(sentences, vector_size=300, window=5, min_count=2, sg=1, epochs=10) vec = model.wv['king'] # shape (300,) model.wv.most_similar('king', topn=5) model.wv.most_similar(positive=['king','woman'], negative=['man'])
27What is the strengths and limitations grid for Word2Vec?

Strengths: captures analogies/semantic relationships, dense (300-dim vs 100k), transferable pretrained, fast inference (~12ms). Limitations: one vector per word (no polysemy), no OOV (zero vector), mean-pooling loses order, underperforms BoW on sentiment (85.7% vs 86.2%).

28What is the difference between feature extraction and fine-tuning when using BERT?

Feature extraction: BERT weights frozen — only extract [CLS] embedding, then train a classifier on top. Cheaper, ~90–92% accuracy. Fine-tuning: all weights updated — adapts BERT to task, achieves 93.9%. Fine-tuning is standard; feature extraction only when compute is severely limited.

29Why did Word2Vec achieve 85.7% (less than BoW's 86.2%) on IMDb despite being a "smarter" representation?

Three reasons: (1) mean-pooling loses word order/negation, (2) long IMDb reviews (500+ words) → average vector is a blurry centroid, (3) Word2Vec brings syntactically similar words close together even when they have opposite sentiment ("terrible" and "brilliant" may appear in similar positions).

30When would you use static embeddings (Word2Vec/FastText) vs contextual embeddings (BERT) in production?

FastText: real-time chatbot (<50ms), multilingual/OOV-heavy, no GPU. Sentence-BERT/Word2Vec: semantic search and document similarity. BERT fine-tuned: high-stakes classification with GPU available. DistilBERT/FastText: edge or mobile deployment.

07
Transformers & Large Language Models
Attention, BERT, GPT, LoRA, RAG, Prompting