Deep Learning & NLP — ENSAM Casablanca

Introduction to
Natural Language Processing

Complete course notes — language layers, NLP pipeline (cleaning, tokenization, stemming), structural analysis (POS, NER, parsing), text representation (BoW, TF-IDF, Word2Vec, BERT). Based on ENSAM 2025/2026 lecture PDFs.

Module04 of 07

SourceIntro NLP Lecture — ENSAM

ToolsNLTK · spaCy · HuggingFace

01Layers of Human Language

Key insight: Natural language is ambiguous at every level. "bank" = financial institution OR riverbank. "I saw the man with the telescope" has two valid parse trees. This is what makes NLP hard.

Layer	Studies	Example	NLP Connection
Phonology	Sounds & patterns	The /k/ in "cat" vs "chord"	Speech-to-text, audio models
Morphology	Word structure, meaning units	un-break-able → prefix + root + suffix	Tokenization, stemming, lemmatization
Syntax	Grammar — how words form sentences	"Dog bites man" ≠ "Man bites dog"	Parsing, POS tagging, grammar check
Semantics	Literal meaning of words/sentences	"Colourless green ideas" = syntactically valid but meaningless	Word embeddings, semantic search
Pragmatics	Meaning shaped by context/intent	"Can you pass the salt?" = request, not a question	Intent detection, dialogue systems

These layers are hierarchical and interdependent. Early NLP failed by focusing only on lower layers (phonology, syntax) and ignoring higher ones (pragmatics).

02What is NLP?

Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human languages — bridging the gap between human communication and computer comprehension. It combines:

Linguistics

How language works — grammar, syntax, semantics, pragmatics. The scientific study of language structure and meaning.

Computer Science

How to process it — algorithms, data structures, parsing, search. The engineering side of handling language computationally.

Machine Learning

How to learn from data — pattern recognition, neural networks, embeddings. Learning language representations from massive corpora.

Historical Evolution

1950s–60s: Rule-based
1980s–90s: Statistical
2000s: Classic ML
2010s: Deep Learning
2017–now: Transformers & LLMs

03NLP Applications

2.5 quintillion bytes of data produced daily — the vast majority is natural language text. Without NLP, this enormous body of knowledge is invisible to machines.

Application	Example	NLP Task
Search Engines	Google parsing "best pizza near me tonight"	Intent detection + location + time extraction
Virtual Assistants	Siri, Alexa, Google Assistant	Speech recognition + intent + response generation
Machine Translation	Google Translate, DeepL	Seq2Seq modeling, neural translation
Spam Filtering	Gmail, Outlook	Text classification
Sentiment Analysis	Brand monitoring on social media	Opinion mining, classification
Medical NLP	Extracting diagnoses from clinical notes	NER, relation extraction
Code Generation	GitHub Copilot, Claude, ChatGPT	Language modeling, code synthesis
Autocomplete	Keyboard prediction	Language modeling (next word prediction)

04NLU vs NLG — Two Directions of Language AI

NLU — Natural Language Understanding

The machine reads/hears text and extracts structured meaning.
Input: messy human language → Output: structured data, labels, decisions.

Tasks: Sentiment Classification, Named Entity Recognition, Intent Detection, Question Answering, Fact Extraction

NLG — Natural Language Generation

The machine produces coherent, fluent text from structured data.
Input: data or prompts → Output: human-readable language.

Tasks: Summarization, Machine Translation, Dialogue Response, Report Generation, Code Generation

Combined Systems: Most modern systems do both. A chatbot uses NLU to parse your question and NLG to write a response. LLMs like GPT and Claude are fundamentally NLG models — but can be prompted to perform NLU tasks.

Text Processing Pipeline

From Raw Text to Machine-Ready Features

05Cleaning & Tokenization

Step 1 — Text Cleaning (6 Steps)

Remove noise before any processing. Each step has a specific purpose — order matters:

#	Step	Code	Why
1	Lowercase	`text.lower()`	Reduces vocabulary ("Park" = "park")
2	Remove HTML tags	`re.sub(r'<[^>]+>', ' ', text)`	Web-scraped data contains markup
3	Remove URLs	`re.sub(r'http\S+\|www\S+', ' ', text)`	URLs are noise for most NLP tasks
4	Remove punctuation	`re.sub(r'[^\w\s]', ' ', text)`	Punctuation rarely carries meaning
5	Remove numbers (optional)	`re.sub(r'\d+', ' ', text)`	Task-dependent — keep for financial data
6	Normalize whitespace	`re.sub(r'\s+', ' ', text).strip()`	Steps 2–5 leave extra spaces

import re def clean_text(text): text = text.lower() # 1. lowercase text = re.sub(r'<[^>]+>', ' ', text) # 2. strip HTML text = re.sub(r'http\S+|www\S+', ' ', text) # 3. remove URLs text = re.sub(r'[^\w\s]', ' ', text) # 4. remove punctuation text = re.sub(r'\d+', ' ', text) # 5. remove numbers (optional) text = re.sub(r'\s+', ' ', text).strip() # 6. normalize whitespace return text # Example: clean_text("<p>I LOVED this film!!! Visit http://imdb.com</p>") # → "i loved this film visit"

Step 2 — Tokenization

Split text into individual units (tokens). Modern LLMs use subword tokenization (BPE — Byte Pair Encoding) to handle rare and unknown words.

Word tokenization: ["the", "cats", "are", "running", "quickly", "in", "park"] Subword (BPE): "unbelievable" → ["un", "believ", "able"] # NLTK import nltk; nltk.download('punkt') tokens = nltk.word_tokenize(text) # HuggingFace (BERT BPE) from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("bert-base-uncased") tokens = tok.tokenize(text) Libraries: NLTK · spaCy · HuggingFace Transformers

06Stop Words, Stemming & Lemmatization

Step 3 — Stop Word Removal

Remove high-frequency, low-information words: "the", "is", "a", "in", "are" — focus on content-bearing tokens.

Input tokens: ["the", "cats", "are", "running", "quickly", "in", "park"] After removal: ["cats", "running", "quickly", "park"] from nltk.corpus import stopwords stop = set(stopwords.words('english')) filtered = [w for w in tokens if w not in stop]

Critical warning — do NOT blindly remove all stop words: "not", "no", "never" are classified as stop words in NLTK's default list, but they are essential for negation. Removing "not" turns "This film was not good" → "This film was good" — opposite sentiment. Always evaluate which stop words are relevant to your task before removing them.

Step 4 — Stemming & Lemmatization

Reduce words to their base form so "run", "runs", "running", "ran" are treated as the same concept.

Word	Stemming	Lemmatization
running	run	run
studies	studi (crude!)	study (correct)
better	better	good (uses POS)

from nltk.stem import PorterStemmer, WordNetLemmatizer stemmer = PorterStemmer() lemmatizer = WordNetLemmatizer() stemmer.stem("running") # → "run" lemmatizer.lemmatize("better", pos="a") # → "good" # spaCy (faster and more accurate) doc = nlp(text) for token in doc: print(token.text, token.lemma_) # lemma_ attribute

Stemming: Fast, crude — chops off suffixes (may produce non-words like "studi").
Lemmatization: Slower, accurate — uses vocabulary and POS context to find the proper base form.

Structural Analysis

Grammar, Entities, and Relationships

07POS Tagging & Named Entity Recognition

POS — Part-of-Speech Tagging

Assign a grammatical role to each token. The same word can be different parts of speech depending on usage.

"The quick brown fox jumps over the lazy dog" The→DET quick→ADJ brown→ADJ fox→NOUN jumps→VERB over→PREP Ambiguity: "I can fish" → can[VERB] fish[VERB] "A tin of fish" → fish[NOUN] import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(text) for token in doc: print(token.text, token.pos_, token.tag_)

NER — Named Entity Recognition

Identify and classify specific real-world objects with proper names. Critical for knowledge graphs and information extraction.

"Apple was founded by Steve Jobs in Cupertino in 1976." Apple → ORG Steve Jobs → PERSON Cupertino → LOC 1976 → DATE Entity types: PERSON, ORG, GPE/LOC, DATE, MONEY, PRODUCT, EVENT, LAW doc = nlp(text) for ent in doc.ents: print(ent.text, ent.label_) # HuggingFace pipeline from transformers import pipeline ner = pipeline("ner", model="dslim/bert-base-NER") result = ner(text)

08Dependency Parsing & Context-Free Grammar

Dependency Parsing

Analyze grammatical relationships between words — which word is the subject, object, modifier of which other word. Answers: who did what to whom?

"The dog bit the man" bit → ROOT dog → nsubj (subject of "bit") man → dobj (direct object of "bit") The → det (determiner of "dog") the → det (determiner of "man") for token in doc: print(token.text, token.dep_, token.head.text)

CFG — Context-Free Grammar

A formal system for defining language as production rules. Foundation of early NLP parsers — reveals WHY language is so hard to parse.

S → NP VP NP → DET NOUN | DET ADJ NOUN VP → VERB NP | VERB The Ambiguity Problem: "I saw the man with the telescope" Parse 1: I [saw [the man with the telescope]] → man had the telescope Parse 2: I [saw [the man] [with the telescope]] → I used the telescope

CFGs cannot resolve such ambiguities — natural language regularly violates CFG assumptions. This drove the move to probabilistic and neural approaches.

Text Representation

Algorithms Need Numbers, Not Words

09Bag of Words & TF-IDF

BoW — Bag of Words

Represent a document as a vector of word counts. Simple and effective for classification, but ignores word order and context entirely.

Vocabulary: [cat, dog, the, loves] "The cat loves the dog" → [1, 1, 2, 1] "The dog loves the cat" → [1, 1, 2, 1] ← SAME VECTOR! Different meaning! from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(max_features=5000) X = vectorizer.fit_transform(corpus)

Advantages: Fast, interpretable, no training required.
Disadvantages: Sparse matrix (99% zeros), negation invisible ("not good" ≈ "good"), synonyms unlinked, order lost.

TF-IDF — Term Frequency–Inverse Document Frequency

Fixes BoW's problem of common words dominating. A word scores high if it appears often in THIS document but rarely across ALL documents.

TF-IDF formula $$\text{TF}(t,d) = \frac{\text{count}(t, d)}{|d|} \quad \text{IDF}(t) = \log\!\left(\frac{N}{df(t)}\right) + 1$$ $$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$$

$N$: total documents · $df(t)$: documents containing term $t$ · High IDF = rare across corpus = informative

from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(max_features=5000, ngram_range=(1,2)) # unigrams + bigrams X = vect.fit_transform(corpus) Example word weights (IDF): "the", "is", "in" → IDF ≈ 1.1 → nearly ignored (noise) "brilliant", "masterpiece" → high IDF → strong positive signal "terrible", "awful" → high IDF → strong negative signal

Advantages: Reduces stop word dominance, interpretable, fast (140× faster than BERT).
Disadvantages: Synonyms ignored ("great" ≠ "excellent"), word order lost, blind to polysemy ("bank").

10Word Embeddings & Contextual Embeddings

Word2Vec / GloVe — Static Embeddings

Train a neural network to predict context words. Discard the network; keep learned weights as dense vector representations. Words with similar meanings cluster in vector space.

Famous vector arithmetic: king − man + woman ≈ queen (gender relationship) France → Paris ≈ Italy → Rome (capital relationship) Dimension: 300 floats per word (vs. 50,000+ for BoW sparse vectors) import gensim.downloader model = gensim.downloader.load('word2vec-google-news-300') vec = model['king'] model.most_similar(positive=['king','woman'], negative=['man']) # → queen

Limitation: One fixed vector per word → "bank" (financial) = "bank" (riverbank). Polysemy is ignored.

BERT / GPT — Contextual Embeddings

Word2Vec gives each word ONE fixed vector. Contextual embeddings compute a different vector for each occurrence based on surrounding context:

"I deposited money at the bank" → bank = [financial institution vector] "We sat by the river bank" → bank = [riverbank vector] Same token — completely different embedding! from transformers import AutoTokenizer, AutoModel import torch tok = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased') inputs = tok(sentence, return_tensors='pt') with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state # shape: [batch, seq_len, 768]

11Evolution of NLP Models

Every generation solved the previous generation's fundamental limitation:

Era	Approach	Key Method	Solved / Failed
1950s–60s	Rule-Based	Hand-written grammar rules	Precise but brittle. Fails on ambiguity and variations.
1980s–90s	Statistical	N-gram language models, HMMs	Handles variation. Fails on long-range dependencies.
2000s	Classic ML	SVM, Naive Bayes, feature engineering	Better generalization. Manual features still needed.
2010s	Deep Learning	RNN/LSTM, word embeddings	Automatic features. Sequential bottleneck.
2017–now	Transformers & LLMs	BERT, GPT, T5, Claude, ChatGPT	Parallel, contextual, scalable. Best performance.

The through-line: Rules → Counts → Dense vectors → Contextual vectors → World knowledge encoded in billions of parameters. Each era is a better answer to the same question. Increasing complexity → Better understanding of meaning → More resources required.

Introduction toNatural Language Processing

Introduction to
Natural Language Processing