MOD 03 RNN / LSTM / GRU — Sequential Models
ENSAM Casablanca · 2025/2026 ↩ Home
Deep Learning & NLP — ENSAM Casablanca

Recurrent Neural Networks
LSTM & GRU

Complete course notes — sequential data, RNN architecture, BPTT, LSTM gates (forget/input/output), GRU, bidirectional RNN, and when to use each model. Based on ENSAM 2025/2026 lecture PDFs.

Module03 of 07
SourcesRNN/LSTM Lecture + ENSAM RNN PDF
LabsTemperature Forecasting · Poetry Generation
01Sequential Data — Order Matters

Sequential data is data where the order of elements carries meaning. The same data point means different things in different contexts.

Text / NLP

"The bank near the river" — same word "bank" has different meaning depending on context. Past words determine current word's meaning.

Time-Series

Stock price at $t=2$ depends on prices at $t=0$ and $t=1$ — order encodes causality. Prediction impossible without temporal context.

Audio & Speech

Individual sound samples only make sense in sequence. "Not good" → negative. "Not bad" → positive. Position determines meaning.

Other Examples

EEG signals, DNA sequences, video frames, weather patterns, robot joint angles, music notes — all require memory of the past.

Key insight: Sequential data has MEMORY — the past shapes the present. Context is everything. Order-independent models destroy all this meaning.
02Why Feed-Forward Networks Fail on Sequences

Standard FFNs (MLP, ANN) have three critical limitations that make them unsuitable for sequential data:

  • No temporal memory: The FFN sees isolated tokens with zero awareness of what came before. Classifying "The movie was not good" — it cannot know "not" negates "good".
  • Fixed-length constraint: Real sequences vary in length. Padding wastes compute; truncation destroys information irreversibly.
  • No weight sharing: Learning a pattern (e.g., "not good") at position 5 does not transfer to position 12 — leads to massive parameter counts and poor generalization.
Concrete Failure — EEG Seizure Prediction
Task: Predict seizure based on 1,000 EEG time steps FFN problems: ✗ Needs 1,000 separate input neurons ✗ If data = 999 steps → model breaks ✗ Pattern at step 50 invisible at step 800 RNN solution: ✓ ONE cell scans all steps sequentially ✓ Pattern learned once, applied everywhere ✓ Memory maintained across all time steps
03Recurrent Neural Network Architecture

A Recurrent Neural Network (RNN) adds a loop to a standard network — maintaining a hidden state across time steps. The same weights are shared across all time steps.

RNN hidden state update (at each time step $t$) $$h_t = f(W_h \cdot h_{t-1} + U \cdot x_t + b)$$ $$y_t = g(V \cdot h_t + c)$$

$h_t$: hidden state at step $t$ · $x_t$: input at step $t$ · $h_{t-1}$: previous hidden state
$W_h, U$: weight matrices (shared across all steps) · $b$: bias · $f$: tanh or ReLU
$y_t$: output at step $t$ · $V$: output weight matrix · $g$: output activation (e.g., Softmax)

The hidden state $h_t$ is called "everything I've seen so far, compressed into a vector." The same weights $W_h$ and $U$ are reused at EVERY time step — this is weight sharing across time.

04Backpropagation Through Time (BPTT)

To train RNNs, we unroll the network across time steps — treating it as a very deep network where each layer is one time step. Then standard backpropagation is applied through the unrolled chain.

Total loss (sum over all time steps) $$\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t(y_t, \hat{y}_t)$$

BPTT suffers from the vanishing/exploding gradient problem: gradients must be multiplied by $W_h$ at each time step going backward. If $\|W_h\| < 1$ → gradients shrink exponentially (vanish). If $\|W_h\| > 1$ → gradients explode.

Solution for exploding: Gradient clipping (cap gradients at a max norm).
Solution for vanishing: Use LSTM or GRU with gating mechanisms.

05Types of RNN Architectures (Input/Output Shapes)

Not all problems have the same input/output shape. RNNs are flexible:

TypeInput → OutputUse Case
Many-to-OneSequence → Single valueSentiment analysis (sentence → positive/negative)
One-to-ManySingle value → SequenceImage captioning (image → sentence)
Many-to-Many (sync)Sequence → Sequence (same length)POS tagging (word → tag for each word)
Many-to-Many (async)Sequence → Sequence (different length)Machine translation (Seq2Seq with Encoder-Decoder)
06Vanilla RNN Limitations
  • Short-term memory: Vanilla RNN struggles with long-term dependencies — information from far-back steps effectively "fades away" due to vanishing gradients.
  • Vanishing gradient: Gradients diminish exponentially over many time steps → early steps contribute almost nothing to the learning signal.
  • Sequential bottleneck: Every time step must wait for the previous one — cannot be parallelized on GPU. Transformers solve this by processing all positions simultaneously.
  • No built-in attention: RNNs treat all positions similarly — no direct mechanism to relate position 1 to position 47 directly.
LSTM
Long Short-Term Memory Networks
07LSTM — Overview & Motivation

LSTM (Hochreiter & Schmidhuber, 1997) is a specialized RNN architecture designed to learn and remember information over long sequences, effectively solving the vanishing gradient problem.

Key innovation: a separate cell state $C_t$ acts as a "protected highway" for information — gradient flows through the $+$ operation nearly unchanged. This is why LSTM works for long sequences.

Analogy — Person Reading a Book

Think of the LSTM cell like a person reading a book and taking notes:

  • Forget gate: "Strike through old notes that are no longer relevant" (e.g., the previous paragraph's subject)
  • Input gate: "Write new information that seems important" (e.g., note the new subject "she")
  • Output gate: "Decide what part of your notes to say out loud right now"
08LSTM Gates — Full Equations
1. Forget Gate — "What to Discard"
Forget gate $$f_t = \sigma(W_f \cdot [h_{t-1},\; x_t] + b_f)$$

Output: value between 0 and 1 for each cell state dimension.
0 = "completely forget this" · 1 = "keep everything"
Example: when model sees new paragraph, forget previous subject's gender agreement rules.

2. Input Gate — "What New Information to Store"
Input gate (two parts) $$i_t = \sigma(W_i \cdot [h_{t-1},\; x_t] + b_i) \quad \leftarrow \text{what to update (gate)}$$ $$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1},\; x_t] + b_C) \quad \leftarrow \text{candidate values to write}$$

Together: $i_t$ decides what to update, $\tilde{C}_t$ provides the candidate values. Example: noting new subject "she" → future verbs should use feminine form.

3. Cell State Update — "The Core Memory"
Cell state update $$C_t = f_t \times C_{t-1} + i_t \times \tilde{C}_t$$

Forget some old stuff ($f_t \times C_{t-1}$) + add some new stuff ($i_t \times \tilde{C}_t$).
The gradient flows through the $+$ operation nearly unchanged — this is why LSTM solves vanishing gradient!

4. Output Gate — "What to Output"
Output gate & hidden state $$o_t = \sigma(W_o \cdot [h_{t-1},\; x_t] + b_o) \quad \leftarrow \text{output gate}$$ $$h_t = o_t \times \tanh(C_t) \quad \leftarrow \text{hidden state output}$$

$o_t$ decides what part of cell state $C_t$ to output as $h_t$. Example: cell state knows the full subject, but only outputs what's needed for the current prediction.

Summary — Four Equations
Given: [h_{t-1}, x_t] as combined input 1. Forget gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f) 2. Input gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i) C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) 3. Cell update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t 4. Output gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o) h_t = o_t ⊙ tanh(C_t) Output: h_t (hidden state) and C_t (cell state) → passed to next time step
09LSTM Variants
Bidirectional LSTM

Processes input in both forward and backward directions. Concatenates both hidden states: $[h_\text{forward};\, h_\text{backward}]$. Used when full context is available (document NLP).

Stacked LSTM

Multiple LSTM layers stacked — output of one LSTM layer feeds into the next. Increases learning capacity. Used for complex tasks like speech recognition.

ConvLSTM

Integrates convolution into LSTM for spatial-temporal data. Used for video analysis and weather forecasting where both spatial and temporal dependencies matter.

LSTM Challenges

• Complex architecture — computationally intensive
• 4× more parameters than vanilla RNN
• Slow to train on long sequences
• Still sequential — cannot parallelize

GRU
Gated Recurrent Unit
10GRU — Streamlined LSTM

GRU (Cho et al., 2014) is a simplified LSTM with fewer parameters (~25% fewer). It merges the forget and input gates into a single update gate and eliminates the separate cell state.

GRU equations $$z_t = \sigma(W_z \cdot [h_{t-1},\; x_t] + b_z) \quad \leftarrow \text{update gate (forget + input combined)}$$ $$r_t = \sigma(W_r \cdot [h_{t-1},\; x_t] + b_r) \quad \leftarrow \text{reset gate}$$ $$\tilde{h}_t = \tanh(W \cdot [r_t \ast h_{t-1},\; x_t] + b) \quad \leftarrow \text{candidate hidden state}$$ $$h_t = z_t \times h_{t-1} + (1 - z_t) \times \tilde{h}_t \quad \leftarrow \text{hidden state update}$$
Update Gate $z_t$

Decides how much past state to keep vs. how much of the new candidate to use. Combines forget + input gate from LSTM into one operation.

Reset Gate $r_t$

Decides how much past state influences the new candidate. When $r_t \approx 0$, the new candidate ignores previous state — fresh start.

The hidden state $h_t$ directly serves as both output and memory — no separate cell state. GRU trains faster than LSTM with similar performance on medium-length sequences.

11Bidirectional RNN (BiRNN / BiLSTM)

Bidirectional RNN runs two RNNs in parallel: one forward (left-to-right) and one backward (right-to-left) through the sequence. At each step, concatenates both hidden states: $[h^\rightarrow; h^\leftarrow]$.

Why Bidirectional?

"The bank was by the river bank" — the second "bank" needs context from BOTH directions to be correctly understood as "riverbank." A forward-only RNN only sees what came before.

Use Cases

Named entity recognition, POS tagging, machine translation encoder, question answering — any task where full context is available at inference time.

Limitation

Cannot be used for real-time streaming or autoregressive generation — you don't have future tokens yet. Also requires 2× compute.

12Comparison Table — RNN vs LSTM vs GRU
FeatureVanilla RNNLSTMGRU
Memory typeHidden state onlyHidden state + Cell stateHidden state only
GatesNone3 (forget, input, output)2 (update, reset)
ParametersFewestMost (4× RNN)Middle (3× RNN)
Long-term memoryVery poorExcellentGood
Training speedFastestSlowestMiddle
Vanishing gradientSevereNearly solvedNearly solved
Best forVery short sequences (<10 steps)Long, complex sequences (50+ steps)Medium sequences, limited compute
Introduced~198619972014
13When to Use Which? — Decision Guide
UseWhenExample
Vanilla RNNVery short sequences (<10 steps), maximum speed, learning problemSimple character-level generation with 5-char windows
LSTMLong sequences (50+ steps), precise timing matters, best accuracy needed, complex taskTranslating 100-word paragraphs, stock price 30 days ahead
GRUMedium-length sequences, limited compute (mobile/edge), similar to LSTM but fasterSentiment analysis on tweets, keyword spotting on microcontroller
BiLSTMFull context always available (no streaming), both directions neededNamed entity recognition in a document, question answering
Seq2SeqInput and output lengths differMachine translation, summarization, speech-to-text
Where RNNs Still Win Over Transformers
  • Real-time streaming: Process one token at a time, $O(1)$ memory. Transformers need $O(n^2)$ memory to attend to all past tokens — infeasible for infinite streams.
  • Edge/embedded devices: A tiny GRU runs on a microcontroller for wake-word detection. GPT-style transformers cannot.
  • Irregular time-series: Sensor data, ECG, financial ticks — RNNs handle varying time steps naturally.
  • Reinforcement learning: Agents in partially observable environments use LSTMs as working memory.
Why Transformers Won (for most NLP)
  • Parallelizable training: All positions processed simultaneously — 10–100× faster than sequential RNNs on modern GPUs.
  • Global attention: Each position can directly attend to any other position — no sequential bottleneck.
  • Scales with data: Pre-training on internet-scale datasets only feasible with parallel processing.
Key Mental Model Summary
ConceptOne-line intuition
RNNA network with memory that updates at each step
Hidden state $h_t$"Everything I've seen so far, compressed into a vector"
Vanishing gradientGradients shrink exponentially backwards through time
LSTM cell state $C_t$A protected highway for gradients — only touched by + and ×
Forget gate"What old memory is no longer relevant?"
Input gate"What new information should I write to memory?"
Output gate"What should I actually output right now?"
GRULSTM with forget+input merged, cell state eliminated
BidirectionalRun two RNNs: one left-to-right, one right-to-left
Why Transformers wonParallelizable training + global attention, no sequential bottleneck
Why RNNs persist$O(1)$ inference memory, streaming, tiny devices