Module 01 — Deep Learning Essentials & ANN

A.1 What is Deep Learning?

Deep Learning is a class of machine learning algorithms that uses multiple layers of artificial neural networks to automatically extract features from data and learn complex patterns. Unlike traditional ML, it does not require manual feature engineering — it learns representations directly from raw data.

Three Core Properties

Compositionality (Hierarchical)

Models learn in layers. Each layer captures progressively more complex representations — edges → shapes → objects. The function is a cascade of non-linear transformations: $f(x) = g_n(g_{n-1}(\ldots g_1(x)))$

End-to-End Learning

The model takes raw input (pixels, audio, text) and directly produces output — no separate feature extraction step. Example: speech recognition → raw audio to text in one model.

Distributed Representations

Concepts are distributed across many neurons — no single neuron codes "cat." Groups of neurons work together, making representations robust and generalizable.

It's "Deep" When…

More than one stage of non-linear feature transformation. 2-layer models, SVMs, kernel methods, decision trees are not deep — they have no feature hierarchy.

Before Training — Three Design Questions

Architecture $F(W, X)$: What is the structure of the network? (layers, units, connections)
Loss function $L(W, y_i, X_i)$: How do we measure error between prediction and truth?
Optimization method: How do we update weights to minimize the loss? (gradient descent)

Neural Network Learning Procedure

Perform an inference on the training set (forward pass)
Calculate the error between predictions and actual labels
Determine the contribution of each neuron to the error using backpropagation
Modify the weights to minimize the error using gradient descent

Gradient descent terminates when: (1) the error is sufficiently small, or (2) the max number of iterations is exceeded.

A.2 Machine Learning vs Deep Learning

Aspect	Machine Learning	Deep Learning
Feature engineering	Manual — human selects features	Automatic — learned from data
Data requirement	Works with small datasets	Requires large datasets
Compute	Low (CPU sufficient)	High (GPU required)
Interpretability	Higher — explainable	Lower — black box
Performance on complex tasks	Limited	State of the art
Unstructured data (images, audio)	Struggles	Excels
Traditional pipeline	Data → Feature extraction → Model	Raw data → End-to-end model

Traditional ML handles each step (preprocessing, feature extraction, classification) separately. Deep Learning integrates all steps into one end-to-end model trained jointly.

A.3 Types of Deep Architectures

Theoretician's Dilemma — Why Go Deep?

Shallow models (2-layer) can approximate any function — but deep models are more efficient. Deep architectures represent certain function classes (especially in vision) with far fewer parameters per layer. They can encode complex, hierarchical patterns that shallow models cannot without exponentially more neurons.

Three Architecture Families

1. Feed-Forward (FNN/CNN)

Data flows one direction: input → hidden → output, no cycles. Used for classification, regression. Examples: MLP, CNN. Best for tasks where the input is processed once to produce output.

2. Feed-Back (RNN/Autoencoder)

Outputs are fed back into the system to refine computations. Includes loops. Examples: RNN, Stacked Autoencoders. Used in generative models, sequence modeling.

3. Bi-Directional (DBM/Transformer)

Processes data in both forward and backward directions simultaneously. Examples: Deep Boltzmann Machines, BERT, Bidirectional RNNs. Full context available at each position.

Key DL Architecture Zoo

FNN — tabular data, classification
CNN — images, spatial data
RNN/LSTM/GRU — sequences, time-series
Transformer — NLP, vision
GAN — generative modeling
Encoder-Decoder — translation, segmentation

A.4 The Manifold Hypothesis

Manifold Hypothesis: Natural data lives in a low-dimensional non-linear manifold embedded in high-dimensional space. Variables in natural data are mutually dependent, so they do not actually explore the full space.

Example (faces): A face image is 1000×1000 = 1,000,000 pixels (dimensions). But a face has only ~3 Cartesian coordinates, ~3 Euler angles, and humans have <50 facial muscles. So the actual manifold of face images has <56 dimensions — deep learning is learning to navigate this manifold.

Practical Implication — Invariant Feature Learning

The goal is to embed the input non-linearly into a higher-dimensional space where previously non-separable patterns become separable, then pool together semantically similar regions. This is achieved by stacking:

Normalization → Filter bank → Non-linearity → Pooling (repeated for each deep stage)

A.5 Deep Learning — Advantages & Limitations

Advantages

• Automatic feature extraction — no manual engineering
• State-of-the-art on vision, speech, NLP
• Scales with data and compute
• End-to-end: one unified training signal
• Transfer learning: reuse learned representations
• Handles high-dimensional unstructured data

Limitations

• Requires large labeled datasets
• Computationally expensive — needs GPU
• Black box — poor interpretability
• Prone to overfitting on small datasets
• Sensitive to hyperparameters
• Can fail on distribution shift (train vs real-world)

B.1 ANN Architecture

An Artificial Neural Network (ANN) is inspired by the human brain. It consists of interconnected neurons organized in layers, transforming input data through learned weights to produce output predictions.

Three Layer Types

Input layer: Receives raw data — one neuron per feature (pixel, word index, sensor reading).
Hidden layers: Perform learned transformations. Called "hidden" because they are not directly exposed to input or output. The term "deep" refers to having many hidden layers.
Output layer: Produces the final prediction — one neuron per class (multi-class) or one neuron (regression/binary).

Weights — The "Volume Knob"

A weight $w$ determines how much a specific input signal matters. High weight = high impact. Example: deciding whether to go to a concert — if you hate rain, the "weather" input gets a high weight ($w=10$); if money is not an issue, "ticket price" gets a low weight ($w=1$).

Bias — The "Baseline Tendency"

Bias $b$ is an offset added to the neuron's output. It shifts the decision boundary — the neuron can fire even when all inputs are zero, or stay quiet even if inputs are high. Think of bias as your "mood" before seeing any data.

Neuron computation $$z = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + b = \mathbf{W}\cdot\mathbf{x} + b$$ $$a = f(z) \quad \leftarrow \text{activation function}$$

Non-Linearity — Why It's Essential

Without activation functions, stacking layers collapses to a single linear equation — no matter how many layers you add. Non-linear activations allow the network to "fold, twist, warp" the input space, creating complex decision boundaries that no straight line could capture.

Linear: "The more I work, the more I earn" (straight line). Non-linear: "Water makes plants grow… until too much kills them" (curve). Real-world problems are curves.

B.2 Forward Pass — Computing Predictions

Data flows left to right through the network. At each layer, two operations occur:

Linear transformation: Weighted sum of inputs plus bias
Non-linear activation: Apply activation function to introduce non-linearity

Layer $l$ — forward computation $$\mathbf{Z}^{(l)} = \mathbf{W}^{(l)} \cdot \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$ $$\mathbf{a}^{(l)} = f\!\left(\mathbf{Z}^{(l)}\right)$$

$\mathbf{a}^{(l)}$: activations at layer $l$ · $\mathbf{W}^{(l)}$: weight matrix · $\mathbf{b}^{(l)}$: bias vector · $f$: activation function

Parameter Count

For a layer mapping $n_{in}$ inputs to $n_{out}$ outputs:

Parameters per layer $$\text{Parameters} = (n_{in} \times n_{out}) + n_{out}$$

Example: Network 784 → 256 → 128 → 10

Layer 1 (784→256): 784 × 256 + 256 = 200,960 Layer 2 (256→128): 256 × 128 + 128 = 32,896 Layer 3 (128→10): 128 × 10 + 10 = 1,290 Total: 235,146 parameters

B.3 Activation Functions

An activation function determines whether a neuron should "fire" by outputting a specific value based on its input. They introduce non-linearity — the key ingredient that makes deep networks powerful.

1. Sigmoid

Sigmoid (Logistic) $$\sigma(x) = \frac{1}{1 + e^{-x}} \qquad \text{Range: } (0, 1)$$

Use: Binary classification output (probability interpretation)
Advantage: Smooth gradient, outputs interpretable as probabilities
Disadvantage (Vanishing gradient): For large $|x|$, derivative $\approx 0$ → early layers learn very slowly
Not zero-centered: Outputs always positive → optimization harder

2. Tanh (Hyperbolic Tangent)

Tanh $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \qquad \text{Range: } (-1, 1)$$

Use: Hidden layers in older architectures, RNNs
Advantage: Zero-centered → gradient descent easier than sigmoid
Disadvantage: Still suffers from vanishing gradients for large inputs

3. ReLU — Default for Hidden Layers

ReLU (Rectified Linear Unit) $$f(x) = \max(0, x) \qquad \text{Range: } [0, +\infty)$$

Use: Default for hidden layers in CNNs, MLPs
Advantage: Computationally cheap, no saturation for $x>0$, combats vanishing gradient
Sparse activation: Only active neurons fire → efficient representations
Disadvantage (Dying ReLU): If a neuron always receives negative input, it outputs 0 permanently and never learns

Example: $x = 2.9 \Rightarrow \text{ReLU}(2.9) = 2.9$. $x = -1.5 \Rightarrow \text{ReLU}(-1.5) = 0$

4. Leaky ReLU

Leaky ReLU $$f(x) = \max(0.01x,\; x)$$

Use: When standard ReLU leads to dead neurons
Advantage: Allows a small gradient for $x < 0$ → prevents dying neurons
Disadvantage: Requires tuning of the negative slope (default 0.01)

5. PReLU (Parametric ReLU)

PReLU $$f(x) = \max(\alpha x,\; x) \quad \alpha \text{ is learned during training}$$

Use: Advanced architectures requiring dynamic adaptation
Advantage: Slope $\alpha$ adapts via backprop → better performance

6. Softmax — Multi-class Output

Softmax $$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \qquad \text{Outputs sum to 1}$$

Use: Final layer for multi-class classification
Advantage: Produces probability distribution over classes
Disadvantage: Prone to saturated gradients when probabilities near 0 or 1

7. ELU & 8. Swish

ELU (Exponential Linear Unit) $$f(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$$ Swish (Google) $$f(x) = x \cdot \sigma(x)$$

ELU: Faster convergence, reduces dead neurons, maintains positive mean activations
Swish: Used in EfficientNet. Smooth, non-monotonic. Sometimes outperforms ReLU in deep networks

Comparison Table

Function	Formula	Range	Use Case	Key Issue
Sigmoid	$1/(1+e^{-x})$	(0,1)	Binary output	Vanishing gradient
Tanh	$(e^x-e^{-x})/(e^x+e^{-x})$	(-1,1)	Hidden (old), RNN	Vanishing gradient
ReLU	max(0,x)	[0,∞)	Default hidden	Dead neurons
Leaky ReLU	max(0.01x,x)	(-∞,∞)	Fix dying ReLU	Slope tuning
PReLU	max(αx,x)	(-∞,∞)	Adaptive slope	Extra param
Softmax	$e^{z_i}/\Sigma e^{z_j}$	(0,1) Σ=1	Multi-class output	Saturated
ELU	x or α(eˣ−1)	(-α,∞)	Faster convergence	Compute cost
Swish	x·σ(x)	(-∞,∞)	EfficientNet	Compute cost

B.4 Loss Functions

A loss function measures how wrong the model's predictions are. The entire goal of training is to minimize the loss.

Mean Squared Error (MSE) — Regression

MSE $$\mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Penalizes large errors heavily (squared)
Sensitive to outliers

Binary Cross-Entropy — Binary Classification

Binary Cross-Entropy $$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \right]$$

Used with Sigmoid output
Punishes confident wrong predictions more

Categorical Cross-Entropy — Multi-Class Classification

Categorical Cross-Entropy $$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$$

$y_{i,c} = 1$ if sample $i$ belongs to class $c$, else 0 (one-hot encoding)
Used with Softmax output layer

B.5 Backpropagation

Backpropagation is the process of computing how much each weight contributed to the loss and updating them to reduce error. It works by propagating the error signal backward through the network, layer by layer, using the chain rule of calculus.

Chain Rule — Gradient computation for weight $W^{(l)}$ $$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}}$$

The gradient $\partial \mathcal{L}/\partial w$ tells you: "if I increase this weight slightly, does the loss go up or down, and by how much?" Each factor in the chain is simple to compute; chaining them gives the full gradient.

Forward vs Backward

Forward Pass (Inference)

Data flows input → output. At each layer: linear transform then activation. Early layers: edges/textures. Mid layers: shapes. Final: full object recognition.

Backward Pass (Learning)

Error flows output → input. Gradient is computed at each layer using chain rule. Weights are adjusted in the direction that reduces loss. This is the "credit assignment" problem.

Key insight: The gradient tells us in which direction to adjust each weight to reduce loss. Backprop makes this computation tractable for deep networks.

B.6 Gradient Descent

Gradient descent adjusts weights in the direction opposite to the gradient to minimize the loss. Think of it as rolling a ball down a hill — the ball is the parameters, the hill is the loss landscape, the goal is the valley (minimum loss).

Weight update rule $$W^{(l)}_{\text{new}} = W^{(l)}_{\text{old}} - \eta \cdot \frac{\partial \mathcal{L}}{\partial W^{(l)}}$$

$\eta$ = learning rate (most important hyperparameter). Too large → overshoots minimum. Too small → training crawls.

Three Variants

Variant	Data per update	Pros	Cons
Batch GD	Entire dataset	Stable, accurate gradients	Very slow, memory intensive
Stochastic GD (SGD)	1 sample	Fast updates, can escape local minima	Noisy updates, high variance
Mini-batch GD	32–256 samples	Balance of speed and stability	Batch size is a hyperparameter

Mini-batch GD is the standard in deep learning. Common batch sizes: 32, 64, 128.

Cost Function

Cost function (e.g., MSE for regression) $$J(w, b) = \frac{1}{n} \sum_{i=1}^{n} \left(h(x_i) - y_i\right)^2$$

Gradient descent finds $w, b$ that minimize $J(w,b)$.

B.7 Optimizers

Modern optimizers improve on plain gradient descent by adding momentum and adaptive learning rates to converge faster and more reliably.

Optimizer	Key idea	Formula
SGD	Plain gradient descent	$W = W - \eta \nabla L$
Momentum	Accumulates velocity in gradient direction	$v = \beta v - \eta \nabla L$; $W += v$
RMSprop	Adaptive LR per parameter — divide by running avg of squared gradients	$W = W - \frac{\eta}{\sqrt{v+\varepsilon}}\nabla L$
Adam ★	Momentum + RMSprop combined. Best general-purpose optimizer.	See below

Adam — Adaptive Moment Estimation

Adam update equations $$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L \quad \leftarrow \text{1st moment (mean)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla L)^2 \quad \leftarrow \text{2nd moment (variance)}$$ $$W = W - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}$$

Defaults: $\beta_1=0.9$, $\beta_2=0.999$, $\varepsilon=10^{-8}$. $\hat{m}, \hat{v}$ are bias-corrected estimates.

Adam combines momentum (uses past gradient directions) with adaptive learning rates (different LR per parameter based on gradient history). Works well out of the box for most problems.

B.8 Regularization

Overfitting: Model memorizes training data and fails on new data. Signs: low training loss + high validation loss + large train/val accuracy gap.

Key Techniques

Technique	Mechanism	Effect
L2 (Weight Decay)	Adds $\lambda \\|\mathbf{W}\\|^2$ to loss	Penalizes large weights, drives toward small values
L1	Adds $\lambda \\|\mathbf{W}\\|_1$ to loss	Promotes sparsity — many weights exactly zero
Dropout	Randomly zeros $p\%$ of neurons during training	Forces redundant representations, reduces co-adaptation
Batch Normalization	Normalizes layer inputs per mini-batch	Stabilizes training, acts as regularizer, allows higher LR
Early Stopping	Stop when val loss stops improving	Prevents over-training
Data Augmentation	Generate varied training samples (flip, crop, rotate)	Increases effective dataset size

Dropout

During training, each neuron is set to 0 with probability $p$. At test time, all neurons are active but outputs are scaled by $(1-p)$. Standard values: $p = 0.2\text{–}0.5$.

Batch Normalization

Batch Norm equations $$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}$$ $$y = \gamma \hat{x} + \beta$$

$\mu_B, \sigma_B^2$: batch mean and variance · $\gamma, \beta$: learnable scale and shift · Applied after linear layer, before activation

B.9 Training Pipeline & Hyperparameters

Standard Training Loop

1. Initialize weights (Xavier or He initialization) 2. For each epoch: a. Forward pass → compute predictions ŷ b. Compute loss L(y, ŷ) c. Backward pass → compute gradients ∂L/∂W (backprop) d. Update weights: W = W - η · ∂L/∂W (optimizer.step()) 3. Evaluate on validation set after each epoch 4. Apply early stopping if val_loss stops improving 5. Save model checkpoint at best val performance

Weight Initialization

Poor initialization → slow learning or divergence:

Xavier (for Sigmoid/Tanh) $$W \sim \mathcal{N}\!\left(0,\; \sqrt{\frac{2}{n_{in}+n_{out}}}\right)$$ He (for ReLU) $$W \sim \mathcal{N}\!\left(0,\; \sqrt{\frac{2}{n_{in}}}\right)$$

Too small → vanishing gradients
Too large → exploding gradients

Key Hyperparameters

Hyperparameter	Typical values	Effect
Learning rate $\eta$	$10^{-4}$ to $10^{-2}$	Too high = diverge, too low = slow
Batch size	32, 64, 128	Larger = smoother but slower convergence
Hidden units	64, 128, 256, 512	More = capacity, more overfitting risk
Dropout rate	0.2–0.5	Higher = more regularization
Epochs	10–200	Use early stopping to control

Gradient Checking (Debug Tool)

When implementing backpropagation, verify correctness by comparing the analytical gradient with a numerical approximation:

Numerical Gradient Approximation $$\frac{\partial L}{\partial w} \approx \frac{f(w + \varepsilon) - f(w - \varepsilon)}{2\varepsilon} \qquad \varepsilon \approx 10^{-5}$$

Expected relative error between analytical and numerical gradient: < $10^{-7}$. If larger, suspect a bug in backprop.

Training Curves Interpretation

Pattern	Diagnosis	Action
train_loss ↓, val_loss ↓	Good training	Continue
train_loss ↓, val_loss ↑	Overfitting	Add dropout, reduce capacity, early stop
Both losses high	Underfitting	Increase capacity, train longer
Large train–val gap	Overfitting	More regularization
Small train–val gap	Good generalization	Model is reliable

B.10 Evaluation Metrics

Confusion Matrix (Binary)

Predicted Positive Predicted Negative Actual Positive TP (True Pos) FN (False Neg) Actual Negative FP (False Pos) TN (True Neg)

Core metrics $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$ $$\text{Precision} = \frac{TP}{TP + FP} \quad \leftarrow \text{of all predicted positives, how many are real?}$$ $$\text{Recall} = \frac{TP}{TP + FN} \quad \leftarrow \text{of all actual positives, how many did we catch?}$$ $$F_1 = \frac{2 \times P \times R}{P + R} \quad \leftarrow \text{harmonic mean of Precision and Recall}$$

When to Use Which Metric

Metric	Use when	Example
Accuracy	Classes are balanced	MNIST digit recognition
Recall	False Negatives are costly	Disease detection — missing a sick patient is dangerous
Precision	False Positives are costly	Spam filter — sending real emails to spam is bad
F1-Score	Both FP and FN matter, or class imbalance	Fraud detection
AUC-ROC	Overall discrimination ability (threshold-independent)	Medical AI ranking

Deep Learning Essentials& Artificial Neural Networks

Deep Learning Essentials
& Artificial Neural Networks