Deep Learning Essentials
& Artificial Neural Networks
Complete course notes from professor lectures — DL fundamentals, ANN architecture, activation functions, backpropagation, and training. Based on ENSAM 2025/2026 lecture PDFs.
Deep Learning is a class of machine learning algorithms that uses multiple layers of artificial neural networks to automatically extract features from data and learn complex patterns. Unlike traditional ML, it does not require manual feature engineering — it learns representations directly from raw data.
Models learn in layers. Each layer captures progressively more complex representations — edges → shapes → objects. The function is a cascade of non-linear transformations: $f(x) = g_n(g_{n-1}(\ldots g_1(x)))$
The model takes raw input (pixels, audio, text) and directly produces output — no separate feature extraction step. Example: speech recognition → raw audio to text in one model.
Concepts are distributed across many neurons — no single neuron codes "cat." Groups of neurons work together, making representations robust and generalizable.
More than one stage of non-linear feature transformation. 2-layer models, SVMs, kernel methods, decision trees are not deep — they have no feature hierarchy.
- Architecture $F(W, X)$: What is the structure of the network? (layers, units, connections)
- Loss function $L(W, y_i, X_i)$: How do we measure error between prediction and truth?
- Optimization method: How do we update weights to minimize the loss? (gradient descent)
- Perform an inference on the training set (forward pass)
- Calculate the error between predictions and actual labels
- Determine the contribution of each neuron to the error using backpropagation
- Modify the weights to minimize the error using gradient descent
Gradient descent terminates when: (1) the error is sufficiently small, or (2) the max number of iterations is exceeded.
| Aspect | Machine Learning | Deep Learning |
|---|---|---|
| Feature engineering | Manual — human selects features | Automatic — learned from data |
| Data requirement | Works with small datasets | Requires large datasets |
| Compute | Low (CPU sufficient) | High (GPU required) |
| Interpretability | Higher — explainable | Lower — black box |
| Performance on complex tasks | Limited | State of the art |
| Unstructured data (images, audio) | Struggles | Excels |
| Traditional pipeline | Data → Feature extraction → Model | Raw data → End-to-end model |
Traditional ML handles each step (preprocessing, feature extraction, classification) separately. Deep Learning integrates all steps into one end-to-end model trained jointly.
Shallow models (2-layer) can approximate any function — but deep models are more efficient. Deep architectures represent certain function classes (especially in vision) with far fewer parameters per layer. They can encode complex, hierarchical patterns that shallow models cannot without exponentially more neurons.
Data flows one direction: input → hidden → output, no cycles. Used for classification, regression. Examples: MLP, CNN. Best for tasks where the input is processed once to produce output.
Outputs are fed back into the system to refine computations. Includes loops. Examples: RNN, Stacked Autoencoders. Used in generative models, sequence modeling.
Processes data in both forward and backward directions simultaneously. Examples: Deep Boltzmann Machines, BERT, Bidirectional RNNs. Full context available at each position.
FNN — tabular data, classification
CNN — images, spatial data
RNN/LSTM/GRU — sequences, time-series
Transformer — NLP, vision
GAN — generative modeling
Encoder-Decoder — translation, segmentation
Manifold Hypothesis: Natural data lives in a low-dimensional non-linear manifold embedded in high-dimensional space. Variables in natural data are mutually dependent, so they do not actually explore the full space.
Example (faces): A face image is 1000×1000 = 1,000,000 pixels (dimensions). But a face has only ~3 Cartesian coordinates, ~3 Euler angles, and humans have <50 facial muscles. So the actual manifold of face images has <56 dimensions — deep learning is learning to navigate this manifold.
The goal is to embed the input non-linearly into a higher-dimensional space where previously non-separable patterns become separable, then pool together semantically similar regions. This is achieved by stacking:
• Automatic feature extraction — no manual engineering
• State-of-the-art on vision, speech, NLP
• Scales with data and compute
• End-to-end: one unified training signal
• Transfer learning: reuse learned representations
• Handles high-dimensional unstructured data
• Requires large labeled datasets
• Computationally expensive — needs GPU
• Black box — poor interpretability
• Prone to overfitting on small datasets
• Sensitive to hyperparameters
• Can fail on distribution shift (train vs real-world)
An Artificial Neural Network (ANN) is inspired by the human brain. It consists of interconnected neurons organized in layers, transforming input data through learned weights to produce output predictions.
- Input layer: Receives raw data — one neuron per feature (pixel, word index, sensor reading).
- Hidden layers: Perform learned transformations. Called "hidden" because they are not directly exposed to input or output. The term "deep" refers to having many hidden layers.
- Output layer: Produces the final prediction — one neuron per class (multi-class) or one neuron (regression/binary).
A weight $w$ determines how much a specific input signal matters. High weight = high impact. Example: deciding whether to go to a concert — if you hate rain, the "weather" input gets a high weight ($w=10$); if money is not an issue, "ticket price" gets a low weight ($w=1$).
Bias $b$ is an offset added to the neuron's output. It shifts the decision boundary — the neuron can fire even when all inputs are zero, or stay quiet even if inputs are high. Think of bias as your "mood" before seeing any data.
Without activation functions, stacking layers collapses to a single linear equation — no matter how many layers you add. Non-linear activations allow the network to "fold, twist, warp" the input space, creating complex decision boundaries that no straight line could capture.
Linear: "The more I work, the more I earn" (straight line). Non-linear: "Water makes plants grow… until too much kills them" (curve). Real-world problems are curves.
Data flows left to right through the network. At each layer, two operations occur:
- Linear transformation: Weighted sum of inputs plus bias
- Non-linear activation: Apply activation function to introduce non-linearity
$\mathbf{a}^{(l)}$: activations at layer $l$ · $\mathbf{W}^{(l)}$: weight matrix · $\mathbf{b}^{(l)}$: bias vector · $f$: activation function
For a layer mapping $n_{in}$ inputs to $n_{out}$ outputs:
Example: Network 784 → 256 → 128 → 10
An activation function determines whether a neuron should "fire" by outputting a specific value based on its input. They introduce non-linearity — the key ingredient that makes deep networks powerful.
- Use: Binary classification output (probability interpretation)
- Advantage: Smooth gradient, outputs interpretable as probabilities
- Disadvantage (Vanishing gradient): For large $|x|$, derivative $\approx 0$ → early layers learn very slowly
- Not zero-centered: Outputs always positive → optimization harder
- Use: Hidden layers in older architectures, RNNs
- Advantage: Zero-centered → gradient descent easier than sigmoid
- Disadvantage: Still suffers from vanishing gradients for large inputs
- Use: Default for hidden layers in CNNs, MLPs
- Advantage: Computationally cheap, no saturation for $x>0$, combats vanishing gradient
- Sparse activation: Only active neurons fire → efficient representations
- Disadvantage (Dying ReLU): If a neuron always receives negative input, it outputs 0 permanently and never learns
Example: $x = 2.9 \Rightarrow \text{ReLU}(2.9) = 2.9$. $x = -1.5 \Rightarrow \text{ReLU}(-1.5) = 0$
- Use: When standard ReLU leads to dead neurons
- Advantage: Allows a small gradient for $x < 0$ → prevents dying neurons
- Disadvantage: Requires tuning of the negative slope (default 0.01)
- Use: Advanced architectures requiring dynamic adaptation
- Advantage: Slope $\alpha$ adapts via backprop → better performance
- Use: Final layer for multi-class classification
- Advantage: Produces probability distribution over classes
- Disadvantage: Prone to saturated gradients when probabilities near 0 or 1
- ELU: Faster convergence, reduces dead neurons, maintains positive mean activations
- Swish: Used in EfficientNet. Smooth, non-monotonic. Sometimes outperforms ReLU in deep networks
| Function | Formula | Range | Use Case | Key Issue |
|---|---|---|---|---|
| Sigmoid | $1/(1+e^{-x})$ | (0,1) | Binary output | Vanishing gradient |
| Tanh | $(e^x-e^{-x})/(e^x+e^{-x})$ | (-1,1) | Hidden (old), RNN | Vanishing gradient |
| ReLU | max(0,x) | [0,∞) | Default hidden | Dead neurons |
| Leaky ReLU | max(0.01x,x) | (-∞,∞) | Fix dying ReLU | Slope tuning |
| PReLU | max(αx,x) | (-∞,∞) | Adaptive slope | Extra param |
| Softmax | $e^{z_i}/\Sigma e^{z_j}$ | (0,1) Σ=1 | Multi-class output | Saturated |
| ELU | x or α(eˣ−1) | (-α,∞) | Faster convergence | Compute cost |
| Swish | x·σ(x) | (-∞,∞) | EfficientNet | Compute cost |
A loss function measures how wrong the model's predictions are. The entire goal of training is to minimize the loss.
- Penalizes large errors heavily (squared)
- Sensitive to outliers
- Used with Sigmoid output
- Punishes confident wrong predictions more
- $y_{i,c} = 1$ if sample $i$ belongs to class $c$, else 0 (one-hot encoding)
- Used with Softmax output layer
Backpropagation is the process of computing how much each weight contributed to the loss and updating them to reduce error. It works by propagating the error signal backward through the network, layer by layer, using the chain rule of calculus.
The gradient $\partial \mathcal{L}/\partial w$ tells you: "if I increase this weight slightly, does the loss go up or down, and by how much?" Each factor in the chain is simple to compute; chaining them gives the full gradient.
Data flows input → output. At each layer: linear transform then activation. Early layers: edges/textures. Mid layers: shapes. Final: full object recognition.
Error flows output → input. Gradient is computed at each layer using chain rule. Weights are adjusted in the direction that reduces loss. This is the "credit assignment" problem.
Gradient descent adjusts weights in the direction opposite to the gradient to minimize the loss. Think of it as rolling a ball down a hill — the ball is the parameters, the hill is the loss landscape, the goal is the valley (minimum loss).
$\eta$ = learning rate (most important hyperparameter). Too large → overshoots minimum. Too small → training crawls.
| Variant | Data per update | Pros | Cons |
|---|---|---|---|
| Batch GD | Entire dataset | Stable, accurate gradients | Very slow, memory intensive |
| Stochastic GD (SGD) | 1 sample | Fast updates, can escape local minima | Noisy updates, high variance |
| Mini-batch GD | 32–256 samples | Balance of speed and stability | Batch size is a hyperparameter |
Mini-batch GD is the standard in deep learning. Common batch sizes: 32, 64, 128.
Gradient descent finds $w, b$ that minimize $J(w,b)$.
Modern optimizers improve on plain gradient descent by adding momentum and adaptive learning rates to converge faster and more reliably.
| Optimizer | Key idea | Formula |
|---|---|---|
| SGD | Plain gradient descent | $W = W - \eta \nabla L$ |
| Momentum | Accumulates velocity in gradient direction | $v = \beta v - \eta \nabla L$; $W += v$ |
| RMSprop | Adaptive LR per parameter — divide by running avg of squared gradients | $W = W - \frac{\eta}{\sqrt{v+\varepsilon}}\nabla L$ |
| Adam ★ | Momentum + RMSprop combined. Best general-purpose optimizer. | See below |
Defaults: $\beta_1=0.9$, $\beta_2=0.999$, $\varepsilon=10^{-8}$. $\hat{m}, \hat{v}$ are bias-corrected estimates.
Adam combines momentum (uses past gradient directions) with adaptive learning rates (different LR per parameter based on gradient history). Works well out of the box for most problems.
Overfitting: Model memorizes training data and fails on new data. Signs: low training loss + high validation loss + large train/val accuracy gap.
| Technique | Mechanism | Effect |
|---|---|---|
| L2 (Weight Decay) | Adds $\lambda \|\mathbf{W}\|^2$ to loss | Penalizes large weights, drives toward small values |
| L1 | Adds $\lambda \|\mathbf{W}\|_1$ to loss | Promotes sparsity — many weights exactly zero |
| Dropout | Randomly zeros $p\%$ of neurons during training | Forces redundant representations, reduces co-adaptation |
| Batch Normalization | Normalizes layer inputs per mini-batch | Stabilizes training, acts as regularizer, allows higher LR |
| Early Stopping | Stop when val loss stops improving | Prevents over-training |
| Data Augmentation | Generate varied training samples (flip, crop, rotate) | Increases effective dataset size |
During training, each neuron is set to 0 with probability $p$. At test time, all neurons are active but outputs are scaled by $(1-p)$. Standard values: $p = 0.2\text{–}0.5$.
$\mu_B, \sigma_B^2$: batch mean and variance · $\gamma, \beta$: learnable scale and shift · Applied after linear layer, before activation
Poor initialization → slow learning or divergence:
- Too small → vanishing gradients
- Too large → exploding gradients
| Hyperparameter | Typical values | Effect |
|---|---|---|
| Learning rate $\eta$ | $10^{-4}$ to $10^{-2}$ | Too high = diverge, too low = slow |
| Batch size | 32, 64, 128 | Larger = smoother but slower convergence |
| Hidden units | 64, 128, 256, 512 | More = capacity, more overfitting risk |
| Dropout rate | 0.2–0.5 | Higher = more regularization |
| Epochs | 10–200 | Use early stopping to control |
When implementing backpropagation, verify correctness by comparing the analytical gradient with a numerical approximation:
Expected relative error between analytical and numerical gradient: < $10^{-7}$. If larger, suspect a bug in backprop.
| Pattern | Diagnosis | Action |
|---|---|---|
| train_loss ↓, val_loss ↓ | Good training | Continue |
| train_loss ↓, val_loss ↑ | Overfitting | Add dropout, reduce capacity, early stop |
| Both losses high | Underfitting | Increase capacity, train longer |
| Large train–val gap | Overfitting | More regularization |
| Small train–val gap | Good generalization | Model is reliable |
| Metric | Use when | Example |
|---|---|---|
| Accuracy | Classes are balanced | MNIST digit recognition |
| Recall | False Negatives are costly | Disease detection — missing a sick patient is dangerous |
| Precision | False Positives are costly | Spam filter — sending real emails to spam is bad |
| F1-Score | Both FP and FN matter, or class imbalance | Fraud detection |
| AUC-ROC | Overall discrimination ability (threshold-independent) | Medical AI ranking |