MOD 01 Deep Learning Essentials & ANN
ENSAM Casablanca · 2025/2026 ↩ Home
Deep Learning & NLP — ENSAM Casablanca

Deep Learning Essentials
& Artificial Neural Networks

Complete course notes from professor lectures — DL fundamentals, ANN architecture, activation functions, backpropagation, and training. Based on ENSAM 2025/2026 lecture PDFs.

Module01 of 07
SourcesDL Essentiels + ANN Lectures
SectionsA.1–A.5 · B.1–B.10
Part A
Deep Learning Essentials
A.1 What is Deep Learning?

Deep Learning is a class of machine learning algorithms that uses multiple layers of artificial neural networks to automatically extract features from data and learn complex patterns. Unlike traditional ML, it does not require manual feature engineering — it learns representations directly from raw data.

Three Core Properties
Compositionality (Hierarchical)

Models learn in layers. Each layer captures progressively more complex representations — edges → shapes → objects. The function is a cascade of non-linear transformations: $f(x) = g_n(g_{n-1}(\ldots g_1(x)))$

End-to-End Learning

The model takes raw input (pixels, audio, text) and directly produces output — no separate feature extraction step. Example: speech recognition → raw audio to text in one model.

Distributed Representations

Concepts are distributed across many neurons — no single neuron codes "cat." Groups of neurons work together, making representations robust and generalizable.

It's "Deep" When…

More than one stage of non-linear feature transformation. 2-layer models, SVMs, kernel methods, decision trees are not deep — they have no feature hierarchy.

Before Training — Three Design Questions
  • Architecture $F(W, X)$: What is the structure of the network? (layers, units, connections)
  • Loss function $L(W, y_i, X_i)$: How do we measure error between prediction and truth?
  • Optimization method: How do we update weights to minimize the loss? (gradient descent)
Neural Network Learning Procedure
  1. Perform an inference on the training set (forward pass)
  2. Calculate the error between predictions and actual labels
  3. Determine the contribution of each neuron to the error using backpropagation
  4. Modify the weights to minimize the error using gradient descent

Gradient descent terminates when: (1) the error is sufficiently small, or (2) the max number of iterations is exceeded.

A.2 Machine Learning vs Deep Learning
AspectMachine LearningDeep Learning
Feature engineeringManual — human selects featuresAutomatic — learned from data
Data requirementWorks with small datasetsRequires large datasets
ComputeLow (CPU sufficient)High (GPU required)
InterpretabilityHigher — explainableLower — black box
Performance on complex tasksLimitedState of the art
Unstructured data (images, audio)StrugglesExcels
Traditional pipelineData → Feature extraction → ModelRaw data → End-to-end model

Traditional ML handles each step (preprocessing, feature extraction, classification) separately. Deep Learning integrates all steps into one end-to-end model trained jointly.

A.3 Types of Deep Architectures
Theoretician's Dilemma — Why Go Deep?

Shallow models (2-layer) can approximate any function — but deep models are more efficient. Deep architectures represent certain function classes (especially in vision) with far fewer parameters per layer. They can encode complex, hierarchical patterns that shallow models cannot without exponentially more neurons.

Three Architecture Families
1. Feed-Forward (FNN/CNN)

Data flows one direction: input → hidden → output, no cycles. Used for classification, regression. Examples: MLP, CNN. Best for tasks where the input is processed once to produce output.

2. Feed-Back (RNN/Autoencoder)

Outputs are fed back into the system to refine computations. Includes loops. Examples: RNN, Stacked Autoencoders. Used in generative models, sequence modeling.

3. Bi-Directional (DBM/Transformer)

Processes data in both forward and backward directions simultaneously. Examples: Deep Boltzmann Machines, BERT, Bidirectional RNNs. Full context available at each position.

Key DL Architecture Zoo

FNN — tabular data, classification
CNN — images, spatial data
RNN/LSTM/GRU — sequences, time-series
Transformer — NLP, vision
GAN — generative modeling
Encoder-Decoder — translation, segmentation

A.4 The Manifold Hypothesis

Manifold Hypothesis: Natural data lives in a low-dimensional non-linear manifold embedded in high-dimensional space. Variables in natural data are mutually dependent, so they do not actually explore the full space.

Example (faces): A face image is 1000×1000 = 1,000,000 pixels (dimensions). But a face has only ~3 Cartesian coordinates, ~3 Euler angles, and humans have <50 facial muscles. So the actual manifold of face images has <56 dimensions — deep learning is learning to navigate this manifold.

Practical Implication — Invariant Feature Learning

The goal is to embed the input non-linearly into a higher-dimensional space where previously non-separable patterns become separable, then pool together semantically similar regions. This is achieved by stacking:

Normalization → Filter bank → Non-linearity → Pooling (repeated for each deep stage)
A.5 Deep Learning — Advantages & Limitations
Advantages

• Automatic feature extraction — no manual engineering
• State-of-the-art on vision, speech, NLP
• Scales with data and compute
• End-to-end: one unified training signal
• Transfer learning: reuse learned representations
• Handles high-dimensional unstructured data

Limitations

• Requires large labeled datasets
• Computationally expensive — needs GPU
• Black box — poor interpretability
• Prone to overfitting on small datasets
• Sensitive to hyperparameters
• Can fail on distribution shift (train vs real-world)

Part B
Artificial Neural Networks (ANN)
B.1 ANN Architecture

An Artificial Neural Network (ANN) is inspired by the human brain. It consists of interconnected neurons organized in layers, transforming input data through learned weights to produce output predictions.

Three Layer Types
  • Input layer: Receives raw data — one neuron per feature (pixel, word index, sensor reading).
  • Hidden layers: Perform learned transformations. Called "hidden" because they are not directly exposed to input or output. The term "deep" refers to having many hidden layers.
  • Output layer: Produces the final prediction — one neuron per class (multi-class) or one neuron (regression/binary).
Weights — The "Volume Knob"

A weight $w$ determines how much a specific input signal matters. High weight = high impact. Example: deciding whether to go to a concert — if you hate rain, the "weather" input gets a high weight ($w=10$); if money is not an issue, "ticket price" gets a low weight ($w=1$).

Bias — The "Baseline Tendency"

Bias $b$ is an offset added to the neuron's output. It shifts the decision boundary — the neuron can fire even when all inputs are zero, or stay quiet even if inputs are high. Think of bias as your "mood" before seeing any data.

Neuron computation $$z = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + b = \mathbf{W}\cdot\mathbf{x} + b$$ $$a = f(z) \quad \leftarrow \text{activation function}$$
Non-Linearity — Why It's Essential

Without activation functions, stacking layers collapses to a single linear equation — no matter how many layers you add. Non-linear activations allow the network to "fold, twist, warp" the input space, creating complex decision boundaries that no straight line could capture.

Linear: "The more I work, the more I earn" (straight line). Non-linear: "Water makes plants grow… until too much kills them" (curve). Real-world problems are curves.

B.2 Forward Pass — Computing Predictions

Data flows left to right through the network. At each layer, two operations occur:

  1. Linear transformation: Weighted sum of inputs plus bias
  2. Non-linear activation: Apply activation function to introduce non-linearity
Layer $l$ — forward computation $$\mathbf{Z}^{(l)} = \mathbf{W}^{(l)} \cdot \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$ $$\mathbf{a}^{(l)} = f\!\left(\mathbf{Z}^{(l)}\right)$$

$\mathbf{a}^{(l)}$: activations at layer $l$ · $\mathbf{W}^{(l)}$: weight matrix · $\mathbf{b}^{(l)}$: bias vector · $f$: activation function

Parameter Count

For a layer mapping $n_{in}$ inputs to $n_{out}$ outputs:

Parameters per layer $$\text{Parameters} = (n_{in} \times n_{out}) + n_{out}$$

Example: Network 784 → 256 → 128 → 10

Layer 1 (784→256): 784 × 256 + 256 = 200,960 Layer 2 (256→128): 256 × 128 + 128 = 32,896 Layer 3 (128→10): 128 × 10 + 10 = 1,290 Total: 235,146 parameters
B.3 Activation Functions

An activation function determines whether a neuron should "fire" by outputting a specific value based on its input. They introduce non-linearity — the key ingredient that makes deep networks powerful.

1. Sigmoid
Sigmoid (Logistic) $$\sigma(x) = \frac{1}{1 + e^{-x}} \qquad \text{Range: } (0, 1)$$
  • Use: Binary classification output (probability interpretation)
  • Advantage: Smooth gradient, outputs interpretable as probabilities
  • Disadvantage (Vanishing gradient): For large $|x|$, derivative $\approx 0$ → early layers learn very slowly
  • Not zero-centered: Outputs always positive → optimization harder
2. Tanh (Hyperbolic Tangent)
Tanh $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \qquad \text{Range: } (-1, 1)$$
  • Use: Hidden layers in older architectures, RNNs
  • Advantage: Zero-centered → gradient descent easier than sigmoid
  • Disadvantage: Still suffers from vanishing gradients for large inputs
3. ReLU — Default for Hidden Layers
ReLU (Rectified Linear Unit) $$f(x) = \max(0, x) \qquad \text{Range: } [0, +\infty)$$
  • Use: Default for hidden layers in CNNs, MLPs
  • Advantage: Computationally cheap, no saturation for $x>0$, combats vanishing gradient
  • Sparse activation: Only active neurons fire → efficient representations
  • Disadvantage (Dying ReLU): If a neuron always receives negative input, it outputs 0 permanently and never learns

Example: $x = 2.9 \Rightarrow \text{ReLU}(2.9) = 2.9$. $x = -1.5 \Rightarrow \text{ReLU}(-1.5) = 0$

4. Leaky ReLU
Leaky ReLU $$f(x) = \max(0.01x,\; x)$$
  • Use: When standard ReLU leads to dead neurons
  • Advantage: Allows a small gradient for $x < 0$ → prevents dying neurons
  • Disadvantage: Requires tuning of the negative slope (default 0.01)
5. PReLU (Parametric ReLU)
PReLU $$f(x) = \max(\alpha x,\; x) \quad \alpha \text{ is learned during training}$$
  • Use: Advanced architectures requiring dynamic adaptation
  • Advantage: Slope $\alpha$ adapts via backprop → better performance
6. Softmax — Multi-class Output
Softmax $$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \qquad \text{Outputs sum to 1}$$
  • Use: Final layer for multi-class classification
  • Advantage: Produces probability distribution over classes
  • Disadvantage: Prone to saturated gradients when probabilities near 0 or 1
7. ELU & 8. Swish
ELU (Exponential Linear Unit) $$f(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$$ Swish (Google) $$f(x) = x \cdot \sigma(x)$$
  • ELU: Faster convergence, reduces dead neurons, maintains positive mean activations
  • Swish: Used in EfficientNet. Smooth, non-monotonic. Sometimes outperforms ReLU in deep networks
Comparison Table
FunctionFormulaRangeUse CaseKey Issue
Sigmoid$1/(1+e^{-x})$(0,1)Binary outputVanishing gradient
Tanh$(e^x-e^{-x})/(e^x+e^{-x})$(-1,1)Hidden (old), RNNVanishing gradient
ReLUmax(0,x)[0,∞)Default hiddenDead neurons
Leaky ReLUmax(0.01x,x)(-∞,∞)Fix dying ReLUSlope tuning
PReLUmax(αx,x)(-∞,∞)Adaptive slopeExtra param
Softmax$e^{z_i}/\Sigma e^{z_j}$(0,1) Σ=1Multi-class outputSaturated
ELUx or α(eˣ−1)(-α,∞)Faster convergenceCompute cost
Swishx·σ(x)(-∞,∞)EfficientNetCompute cost
B.4 Loss Functions

A loss function measures how wrong the model's predictions are. The entire goal of training is to minimize the loss.

Mean Squared Error (MSE) — Regression
MSE $$\mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
  • Penalizes large errors heavily (squared)
  • Sensitive to outliers
Binary Cross-Entropy — Binary Classification
Binary Cross-Entropy $$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \right]$$
  • Used with Sigmoid output
  • Punishes confident wrong predictions more
Categorical Cross-Entropy — Multi-Class Classification
Categorical Cross-Entropy $$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$$
  • $y_{i,c} = 1$ if sample $i$ belongs to class $c$, else 0 (one-hot encoding)
  • Used with Softmax output layer
B.5 Backpropagation

Backpropagation is the process of computing how much each weight contributed to the loss and updating them to reduce error. It works by propagating the error signal backward through the network, layer by layer, using the chain rule of calculus.

Chain Rule — Gradient computation for weight $W^{(l)}$ $$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}}$$

The gradient $\partial \mathcal{L}/\partial w$ tells you: "if I increase this weight slightly, does the loss go up or down, and by how much?" Each factor in the chain is simple to compute; chaining them gives the full gradient.

Forward vs Backward
Forward Pass (Inference)

Data flows input → output. At each layer: linear transform then activation. Early layers: edges/textures. Mid layers: shapes. Final: full object recognition.

Backward Pass (Learning)

Error flows output → input. Gradient is computed at each layer using chain rule. Weights are adjusted in the direction that reduces loss. This is the "credit assignment" problem.

Key insight: The gradient tells us in which direction to adjust each weight to reduce loss. Backprop makes this computation tractable for deep networks.
B.6 Gradient Descent

Gradient descent adjusts weights in the direction opposite to the gradient to minimize the loss. Think of it as rolling a ball down a hill — the ball is the parameters, the hill is the loss landscape, the goal is the valley (minimum loss).

Weight update rule $$W^{(l)}_{\text{new}} = W^{(l)}_{\text{old}} - \eta \cdot \frac{\partial \mathcal{L}}{\partial W^{(l)}}$$

$\eta$ = learning rate (most important hyperparameter). Too large → overshoots minimum. Too small → training crawls.

Three Variants
VariantData per updateProsCons
Batch GDEntire datasetStable, accurate gradientsVery slow, memory intensive
Stochastic GD (SGD)1 sampleFast updates, can escape local minimaNoisy updates, high variance
Mini-batch GD32–256 samplesBalance of speed and stabilityBatch size is a hyperparameter

Mini-batch GD is the standard in deep learning. Common batch sizes: 32, 64, 128.

Cost Function
Cost function (e.g., MSE for regression) $$J(w, b) = \frac{1}{n} \sum_{i=1}^{n} \left(h(x_i) - y_i\right)^2$$

Gradient descent finds $w, b$ that minimize $J(w,b)$.

B.7 Optimizers

Modern optimizers improve on plain gradient descent by adding momentum and adaptive learning rates to converge faster and more reliably.

OptimizerKey ideaFormula
SGDPlain gradient descent$W = W - \eta \nabla L$
MomentumAccumulates velocity in gradient direction$v = \beta v - \eta \nabla L$; $W += v$
RMSpropAdaptive LR per parameter — divide by running avg of squared gradients$W = W - \frac{\eta}{\sqrt{v+\varepsilon}}\nabla L$
Adam ★Momentum + RMSprop combined. Best general-purpose optimizer.See below
Adam — Adaptive Moment Estimation
Adam update equations $$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L \quad \leftarrow \text{1st moment (mean)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla L)^2 \quad \leftarrow \text{2nd moment (variance)}$$ $$W = W - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}$$

Defaults: $\beta_1=0.9$, $\beta_2=0.999$, $\varepsilon=10^{-8}$. $\hat{m}, \hat{v}$ are bias-corrected estimates.

Adam combines momentum (uses past gradient directions) with adaptive learning rates (different LR per parameter based on gradient history). Works well out of the box for most problems.

B.8 Regularization

Overfitting: Model memorizes training data and fails on new data. Signs: low training loss + high validation loss + large train/val accuracy gap.

Key Techniques
TechniqueMechanismEffect
L2 (Weight Decay)Adds $\lambda \|\mathbf{W}\|^2$ to lossPenalizes large weights, drives toward small values
L1Adds $\lambda \|\mathbf{W}\|_1$ to lossPromotes sparsity — many weights exactly zero
DropoutRandomly zeros $p\%$ of neurons during trainingForces redundant representations, reduces co-adaptation
Batch NormalizationNormalizes layer inputs per mini-batchStabilizes training, acts as regularizer, allows higher LR
Early StoppingStop when val loss stops improvingPrevents over-training
Data AugmentationGenerate varied training samples (flip, crop, rotate)Increases effective dataset size
Dropout

During training, each neuron is set to 0 with probability $p$. At test time, all neurons are active but outputs are scaled by $(1-p)$. Standard values: $p = 0.2\text{–}0.5$.

Batch Normalization
Batch Norm equations $$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}$$ $$y = \gamma \hat{x} + \beta$$

$\mu_B, \sigma_B^2$: batch mean and variance · $\gamma, \beta$: learnable scale and shift · Applied after linear layer, before activation

B.9 Training Pipeline & Hyperparameters
Standard Training Loop
1. Initialize weights (Xavier or He initialization) 2. For each epoch: a. Forward pass → compute predictions ŷ b. Compute loss L(y, ŷ) c. Backward pass → compute gradients ∂L/∂W (backprop) d. Update weights: W = W - η · ∂L/∂W (optimizer.step()) 3. Evaluate on validation set after each epoch 4. Apply early stopping if val_loss stops improving 5. Save model checkpoint at best val performance
Weight Initialization

Poor initialization → slow learning or divergence:

Xavier (for Sigmoid/Tanh) $$W \sim \mathcal{N}\!\left(0,\; \sqrt{\frac{2}{n_{in}+n_{out}}}\right)$$ He (for ReLU) $$W \sim \mathcal{N}\!\left(0,\; \sqrt{\frac{2}{n_{in}}}\right)$$
  • Too small → vanishing gradients
  • Too large → exploding gradients
Key Hyperparameters
HyperparameterTypical valuesEffect
Learning rate $\eta$$10^{-4}$ to $10^{-2}$Too high = diverge, too low = slow
Batch size32, 64, 128Larger = smoother but slower convergence
Hidden units64, 128, 256, 512More = capacity, more overfitting risk
Dropout rate0.2–0.5Higher = more regularization
Epochs10–200Use early stopping to control
Gradient Checking (Debug Tool)

When implementing backpropagation, verify correctness by comparing the analytical gradient with a numerical approximation:

Numerical Gradient Approximation $$\frac{\partial L}{\partial w} \approx \frac{f(w + \varepsilon) - f(w - \varepsilon)}{2\varepsilon} \qquad \varepsilon \approx 10^{-5}$$

Expected relative error between analytical and numerical gradient: < $10^{-7}$. If larger, suspect a bug in backprop.

Training Curves Interpretation
PatternDiagnosisAction
train_loss ↓, val_loss ↓Good trainingContinue
train_loss ↓, val_loss ↑OverfittingAdd dropout, reduce capacity, early stop
Both losses highUnderfittingIncrease capacity, train longer
Large train–val gapOverfittingMore regularization
Small train–val gapGood generalizationModel is reliable
B.10 Evaluation Metrics
Confusion Matrix (Binary)
Predicted Positive Predicted Negative Actual Positive TP (True Pos) FN (False Neg) Actual Negative FP (False Pos) TN (True Neg)
Core metrics $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$ $$\text{Precision} = \frac{TP}{TP + FP} \quad \leftarrow \text{of all predicted positives, how many are real?}$$ $$\text{Recall} = \frac{TP}{TP + FN} \quad \leftarrow \text{of all actual positives, how many did we catch?}$$ $$F_1 = \frac{2 \times P \times R}{P + R} \quad \leftarrow \text{harmonic mean of Precision and Recall}$$
When to Use Which Metric
MetricUse whenExample
AccuracyClasses are balancedMNIST digit recognition
RecallFalse Negatives are costlyDisease detection — missing a sick patient is dangerous
PrecisionFalse Positives are costlySpam filter — sending real emails to spam is bad
F1-ScoreBoth FP and FN matter, or class imbalanceFraud detection
AUC-ROCOverall discrimination ability (threshold-independent)Medical AI ranking