MOD 04 Introduction to NLP
ENSAM Casablanca · 2025/2026 ↩ Home
Deep Learning & NLP — ENSAM Casablanca

Introduction to
Natural Language Processing

Complete course notes — language layers, NLP pipeline (cleaning, tokenization, stemming), structural analysis (POS, NER, parsing), text representation (BoW, TF-IDF, Word2Vec, BERT). Based on ENSAM 2025/2026 lecture PDFs.

Module04 of 07
SourceIntro NLP Lecture — ENSAM
ToolsNLTK · spaCy · HuggingFace
01Layers of Human Language

Key insight: Natural language is ambiguous at every level. "bank" = financial institution OR riverbank. "I saw the man with the telescope" has two valid parse trees. This is what makes NLP hard.

LayerStudiesExampleNLP Connection
PhonologySounds & patternsThe /k/ in "cat" vs "chord"Speech-to-text, audio models
MorphologyWord structure, meaning unitsun-break-able → prefix + root + suffixTokenization, stemming, lemmatization
SyntaxGrammar — how words form sentences"Dog bites man" ≠ "Man bites dog"Parsing, POS tagging, grammar check
SemanticsLiteral meaning of words/sentences"Colourless green ideas" = syntactically valid but meaninglessWord embeddings, semantic search
PragmaticsMeaning shaped by context/intent"Can you pass the salt?" = request, not a questionIntent detection, dialogue systems

These layers are hierarchical and interdependent. Early NLP failed by focusing only on lower layers (phonology, syntax) and ignoring higher ones (pragmatics).

02What is NLP?

Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human languages — bridging the gap between human communication and computer comprehension. It combines:

Linguistics

How language works — grammar, syntax, semantics, pragmatics. The scientific study of language structure and meaning.

Computer Science

How to process it — algorithms, data structures, parsing, search. The engineering side of handling language computationally.

Machine Learning

How to learn from data — pattern recognition, neural networks, embeddings. Learning language representations from massive corpora.

Historical Evolution

1950s–60s: Rule-based
1980s–90s: Statistical
2000s: Classic ML
2010s: Deep Learning
2017–now: Transformers & LLMs

03NLP Applications

2.5 quintillion bytes of data produced daily — the vast majority is natural language text. Without NLP, this enormous body of knowledge is invisible to machines.

ApplicationExampleNLP Task
Search EnginesGoogle parsing "best pizza near me tonight"Intent detection + location + time extraction
Virtual AssistantsSiri, Alexa, Google AssistantSpeech recognition + intent + response generation
Machine TranslationGoogle Translate, DeepLSeq2Seq modeling, neural translation
Spam FilteringGmail, OutlookText classification
Sentiment AnalysisBrand monitoring on social mediaOpinion mining, classification
Medical NLPExtracting diagnoses from clinical notesNER, relation extraction
Code GenerationGitHub Copilot, Claude, ChatGPTLanguage modeling, code synthesis
AutocompleteKeyboard predictionLanguage modeling (next word prediction)
04NLU vs NLG — Two Directions of Language AI
NLU — Natural Language Understanding

The machine reads/hears text and extracts structured meaning.
Input: messy human language → Output: structured data, labels, decisions.

Tasks: Sentiment Classification, Named Entity Recognition, Intent Detection, Question Answering, Fact Extraction

NLG — Natural Language Generation

The machine produces coherent, fluent text from structured data.
Input: data or prompts → Output: human-readable language.

Tasks: Summarization, Machine Translation, Dialogue Response, Report Generation, Code Generation

Combined Systems: Most modern systems do both. A chatbot uses NLU to parse your question and NLG to write a response. LLMs like GPT and Claude are fundamentally NLG models — but can be prompted to perform NLU tasks.
Text Processing Pipeline
From Raw Text to Machine-Ready Features
05Cleaning & Tokenization
Step 1 — Text Cleaning (6 Steps)

Remove noise before any processing. Each step has a specific purpose — order matters:

#StepCodeWhy
1Lowercasetext.lower()Reduces vocabulary ("Park" = "park")
2Remove HTML tagsre.sub(r'<[^>]+>', ' ', text)Web-scraped data contains markup
3Remove URLsre.sub(r'http\S+|www\S+', ' ', text)URLs are noise for most NLP tasks
4Remove punctuationre.sub(r'[^\w\s]', ' ', text)Punctuation rarely carries meaning
5Remove numbers (optional)re.sub(r'\d+', ' ', text)Task-dependent — keep for financial data
6Normalize whitespacere.sub(r'\s+', ' ', text).strip()Steps 2–5 leave extra spaces
import re def clean_text(text): text = text.lower() # 1. lowercase text = re.sub(r'<[^>]+>', ' ', text) # 2. strip HTML text = re.sub(r'http\S+|www\S+', ' ', text) # 3. remove URLs text = re.sub(r'[^\w\s]', ' ', text) # 4. remove punctuation text = re.sub(r'\d+', ' ', text) # 5. remove numbers (optional) text = re.sub(r'\s+', ' ', text).strip() # 6. normalize whitespace return text # Example: clean_text("<p>I LOVED this film!!! Visit http://imdb.com</p>") # → "i loved this film visit"
Step 2 — Tokenization

Split text into individual units (tokens). Modern LLMs use subword tokenization (BPE — Byte Pair Encoding) to handle rare and unknown words.

Word tokenization: ["the", "cats", "are", "running", "quickly", "in", "park"] Subword (BPE): "unbelievable" → ["un", "believ", "able"] # NLTK import nltk; nltk.download('punkt') tokens = nltk.word_tokenize(text) # HuggingFace (BERT BPE) from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("bert-base-uncased") tokens = tok.tokenize(text) Libraries: NLTK · spaCy · HuggingFace Transformers
06Stop Words, Stemming & Lemmatization
Step 3 — Stop Word Removal

Remove high-frequency, low-information words: "the", "is", "a", "in", "are" — focus on content-bearing tokens.

Input tokens: ["the", "cats", "are", "running", "quickly", "in", "park"] After removal: ["cats", "running", "quickly", "park"] from nltk.corpus import stopwords stop = set(stopwords.words('english')) filtered = [w for w in tokens if w not in stop]
Critical warning — do NOT blindly remove all stop words: "not", "no", "never" are classified as stop words in NLTK's default list, but they are essential for negation. Removing "not" turns "This film was not good" → "This film was good" — opposite sentiment. Always evaluate which stop words are relevant to your task before removing them.
Step 4 — Stemming & Lemmatization

Reduce words to their base form so "run", "runs", "running", "ran" are treated as the same concept.

WordStemmingLemmatization
runningrunrun
studiesstudi (crude!)study (correct)
betterbettergood (uses POS)
from nltk.stem import PorterStemmer, WordNetLemmatizer stemmer = PorterStemmer() lemmatizer = WordNetLemmatizer() stemmer.stem("running") # → "run" lemmatizer.lemmatize("better", pos="a") # → "good" # spaCy (faster and more accurate) doc = nlp(text) for token in doc: print(token.text, token.lemma_) # lemma_ attribute

Stemming: Fast, crude — chops off suffixes (may produce non-words like "studi").
Lemmatization: Slower, accurate — uses vocabulary and POS context to find the proper base form.

Structural Analysis
Grammar, Entities, and Relationships
07POS Tagging & Named Entity Recognition
POS — Part-of-Speech Tagging

Assign a grammatical role to each token. The same word can be different parts of speech depending on usage.

"The quick brown fox jumps over the lazy dog" The→DET quick→ADJ brown→ADJ fox→NOUN jumps→VERB over→PREP Ambiguity: "I can fish" → can[VERB] fish[VERB] "A tin of fish" → fish[NOUN] import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(text) for token in doc: print(token.text, token.pos_, token.tag_)
NER — Named Entity Recognition

Identify and classify specific real-world objects with proper names. Critical for knowledge graphs and information extraction.

"Apple was founded by Steve Jobs in Cupertino in 1976." Apple → ORG Steve Jobs → PERSON Cupertino → LOC 1976 → DATE Entity types: PERSON, ORG, GPE/LOC, DATE, MONEY, PRODUCT, EVENT, LAW doc = nlp(text) for ent in doc.ents: print(ent.text, ent.label_) # HuggingFace pipeline from transformers import pipeline ner = pipeline("ner", model="dslim/bert-base-NER") result = ner(text)
08Dependency Parsing & Context-Free Grammar
Dependency Parsing

Analyze grammatical relationships between words — which word is the subject, object, modifier of which other word. Answers: who did what to whom?

"The dog bit the man" bit → ROOT dog → nsubj (subject of "bit") man → dobj (direct object of "bit") The → det (determiner of "dog") the → det (determiner of "man") for token in doc: print(token.text, token.dep_, token.head.text)
CFG — Context-Free Grammar

A formal system for defining language as production rules. Foundation of early NLP parsers — reveals WHY language is so hard to parse.

S → NP VP NP → DET NOUN | DET ADJ NOUN VP → VERB NP | VERB The Ambiguity Problem: "I saw the man with the telescope" Parse 1: I [saw [the man with the telescope]] → man had the telescope Parse 2: I [saw [the man] [with the telescope]] → I used the telescope

CFGs cannot resolve such ambiguities — natural language regularly violates CFG assumptions. This drove the move to probabilistic and neural approaches.

Text Representation
Algorithms Need Numbers, Not Words
09Bag of Words & TF-IDF
BoW — Bag of Words

Represent a document as a vector of word counts. Simple and effective for classification, but ignores word order and context entirely.

Vocabulary: [cat, dog, the, loves] "The cat loves the dog" → [1, 1, 2, 1] "The dog loves the cat" → [1, 1, 2, 1] ← SAME VECTOR! Different meaning! from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(max_features=5000) X = vectorizer.fit_transform(corpus)

Advantages: Fast, interpretable, no training required.
Disadvantages: Sparse matrix (99% zeros), negation invisible ("not good" ≈ "good"), synonyms unlinked, order lost.

TF-IDF — Term Frequency–Inverse Document Frequency

Fixes BoW's problem of common words dominating. A word scores high if it appears often in THIS document but rarely across ALL documents.

TF-IDF formula $$\text{TF}(t,d) = \frac{\text{count}(t, d)}{|d|} \quad \text{IDF}(t) = \log\!\left(\frac{N}{df(t)}\right) + 1$$ $$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$$

$N$: total documents · $df(t)$: documents containing term $t$ · High IDF = rare across corpus = informative

from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(max_features=5000, ngram_range=(1,2)) # unigrams + bigrams X = vect.fit_transform(corpus) Example word weights (IDF): "the", "is", "in" → IDF ≈ 1.1 → nearly ignored (noise) "brilliant", "masterpiece" → high IDF → strong positive signal "terrible", "awful" → high IDF → strong negative signal

Advantages: Reduces stop word dominance, interpretable, fast (140× faster than BERT).
Disadvantages: Synonyms ignored ("great" ≠ "excellent"), word order lost, blind to polysemy ("bank").

10Word Embeddings & Contextual Embeddings
Word2Vec / GloVe — Static Embeddings

Train a neural network to predict context words. Discard the network; keep learned weights as dense vector representations. Words with similar meanings cluster in vector space.

Famous vector arithmetic: king − man + woman ≈ queen (gender relationship) France → Paris ≈ Italy → Rome (capital relationship) Dimension: 300 floats per word (vs. 50,000+ for BoW sparse vectors) import gensim.downloader model = gensim.downloader.load('word2vec-google-news-300') vec = model['king'] model.most_similar(positive=['king','woman'], negative=['man']) # → queen

Limitation: One fixed vector per word → "bank" (financial) = "bank" (riverbank). Polysemy is ignored.

BERT / GPT — Contextual Embeddings

Word2Vec gives each word ONE fixed vector. Contextual embeddings compute a different vector for each occurrence based on surrounding context:

"I deposited money at the bank" → bank = [financial institution vector] "We sat by the river bank" → bank = [riverbank vector] Same token — completely different embedding! from transformers import AutoTokenizer, AutoModel import torch tok = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased') inputs = tok(sentence, return_tensors='pt') with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state # shape: [batch, seq_len, 768]
11Evolution of NLP Models

Every generation solved the previous generation's fundamental limitation:

EraApproachKey MethodSolved / Failed
1950s–60sRule-BasedHand-written grammar rulesPrecise but brittle. Fails on ambiguity and variations.
1980s–90sStatisticalN-gram language models, HMMsHandles variation. Fails on long-range dependencies.
2000sClassic MLSVM, Naive Bayes, feature engineeringBetter generalization. Manual features still needed.
2010sDeep LearningRNN/LSTM, word embeddingsAutomatic features. Sequential bottleneck.
2017–nowTransformers & LLMsBERT, GPT, T5, Claude, ChatGPTParallel, contextual, scalable. Best performance.

The through-line: Rules → Counts → Dense vectors → Contextual vectors → World knowledge encoded in billions of parameters. Each era is a better answer to the same question. Increasing complexity → Better understanding of meaning → More resources required.