Course Notes — Academic Year 2025/2026

Deep Learning
& NLP

7 modules of complete course notes rebuilt from the professor's lecture PDFs. From ANN and CNN to Transformers and LLMs. ENSAM Casablanca · Hassan II University.

Modules7 complete modules

SourceProfessor lecture PDFs

MathKaTeX — LaTeX formulas

Exam18 May 2026 · Written redaction

Exam Countdown

--Days

:

--Hours

:

--Min

:

--Sec

Exam Preparation

Q&A210 Exam Questions — All Modules

30 questions per module · Answers hidden behind click · Covers every formula, benchmark, and concept from all 7 modules. Start here the night before the exam.

210 QuestionsClick to revealAll Modules

Q&A240 Exam Questions — Direct Answers

Same questions, concise 1–3 sentence answers. Perfect for quick review and last-minute drilling.

240 QuestionsDirect AnswersQuick Review

STUDYCourse Summaries — All 7 Modules

Condensed summaries covering every formula, benchmark, and concept. Re-read the entire course in one sitting.

7 ModulesExam ReadyQuick Re-Read

Course Modules

MOD 01Deep Learning Essentials & ANN

What is DL (compositionality, end-to-end, distributed representations), ML vs DL, deep architecture types, manifold hypothesis, ANN forward pass, all activation functions, loss functions, backpropagation, gradient descent, optimizers, regularization.

DL FundamentalsANNBackprop

MOD 02Convolutional Neural Networks

Why CNNs (parameter explosion, no spatial awareness), convolution operation (filter as flashlight), padding & stride formula, pooling (max/avg/global), CNN architectures (LeNet → EfficientNet, ResNet skip connections), transfer learning, data augmentation, evaluation metrics.

CNNResNetTransfer Learning

MOD 03RNN / LSTM / GRU

Sequential data, why FFN fails on sequences, RNN hidden state equation, BPTT, RNN types (many-to-one etc.), LSTM (all 4 gate equations with cell state highway), GRU (update & reset gates), BiRNN, comparison table, when to use each.

RNNLSTMGRUSeq2Seq

MOD 04NLP Introduction & Text Processing

Language layers (phonology → pragmatics), NLP definition, NLU vs NLG, full NLP pipeline, cleaning (regex 6 steps), tokenization (word/sentence/BPE), stop words (danger: "not"), stemming vs lemmatization, POS tagging, NER, dependency parsing, CFG.

NLPPipelinespaCyNLTK

MOD 05Classical Text Representation

Bag of Words (counting vector, 99% sparse), N-grams (captures negation via bigrams), TF-IDF (TF×IDF formula, real IDF values from 50k IMDb). Results: BoW 86.2% → TF-IDF 1+2gram 90.1%. When to stop at classical methods.

BoWTF-IDFN-gramsIMDb 50k

MOD 06Word Embeddings — Static & Contextual

Word2Vec (Skip-gram/CBOW, king−man+woman≈queen), GloVe (global co-occurrence), FastText (subwords, OOV), polysemy problem, BERT (MLM+NSP, 110M params, 93.9%). Exact IMDb benchmark: GloVe 76.8%, W2V 85.7%, FastText 86.0%, BERT 93.9%.

Word2VecGloVeBERTEmbeddings

MOD 07Transformers & Large Language Models

Why Transformers (sequential bottleneck → parallel), attention formula Attention(Q,K,V) = softmax(QKᵀ/√d_k)V, multi-head attention (h=8 heads), full encoder-decoder architecture, Add&Norm, positional encoding, LLMs (scaling, emergent abilities), 3-stage training (pre-training → SFT → RLHF), prompt engineering (zero-shot, few-shot, CoT), RAG, PEFT/LoRA, model selection framework.

TransformersLLMsAttentionRAGLoRA

Must-Know Formulas

Convolution Output Size

W_out = floor((W_in - K + 2P) / S) + 1

K = kernel, P = padding, S = stride

Gradient Descent Update

w ← w − η · ∂L/∂w

η = learning rate, L = loss function

LSTM Cell Update

C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t

f = forget, i = input, C̃ = candidate

LSTM Output

h_t = o_t ⊙ tanh(C_t)

o_t = output gate

TF-IDF

TF-IDF(t,d) = TF(t,d) × log(N/df(t))+1

N = total docs, df(t) = docs with term t

Scaled Dot-Product Attention

Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Q/K/V = query, key, value matrices

RNN Hidden State

h_t = f(W_h·h_{t-1} + U·x_t + b)

Shared weights at every time step

LoRA Weight Update

W' = W + (α/r)·AB

r = rank (tiny), A∈R^{d×r}, B∈R^{r×k}

IMDb 50k Benchmark — Full Results

Method	Type	Accuracy	F1	Latency	GPU?	OOV	Key Strength
Bag of Words	Classical	86.2%	0.86	~1 ms	No	No	Simple, interpretable, no training
N-gram (1+2)	Classical	~89%	~0.89	~3 ms	No	No	Captures "not good" bigrams
TF-IDF 1+2gram	Classical	90.1%	0.90	~3 ms	No	No	Best classical — rare word weighting
Word2Vec (mean pool)	Static	85.7%	0.86	~12 ms	No	No	Semantic similarity, vector analogies
GloVe (mean pool)	Static	76.8%	0.77	~10 ms	No	No	Analogy champion (WordSim-353)
FastText (mean pool)	Static	86.0%	0.86	~15 ms	No	Yes	OOV, noisy text, morphological languages
BERT fine-tuned	Contextual	93.9%	0.94	~370 ms (CPU)	Yes	Yes	Polysemy, negation, bidirectional context

Decision ladder (Golden Rule):
Start with TF-IDF + bigrams (fast baseline, no GPU, 90.1%). → If OOV/noisy text: use FastText. → If polysemy, negation, or nuance is critical and GPU available: BERT fine-tuned (93.9%).

Architecture Comparison — Three Families

Family	Typical Data	Processing	Strengths	Limitations
Feed-Forward (MLP)	Tabular, vectors	Dense layers stacked	Simplicity, universality	No spatial/temporal structure; parameter explosion on images
CNN	Images, 2D/3D signals	Local convolutions + pooling	Spatial invariance, feature hierarchy, transfer learning	No native temporal handling; needs large datasets
RNN / LSTM / GRU	Sequences (text, time-series, speech)	Temporal loop with hidden state + gates	Long-term memory (LSTM); streaming O(1) inference	Sequential training (slow); no native attention
Transformer	Any sequence (text, images, audio)	Parallel self-attention on all tokens	Parallelism; scales to internet-size data; emergent abilities	O(n²) memory in attention; needs massive compute to shine