Introduction to
Natural Language Processing
Complete course notes — language layers, NLP pipeline (cleaning, tokenization, stemming), structural analysis (POS, NER, parsing), text representation (BoW, TF-IDF, Word2Vec, BERT). Based on ENSAM 2025/2026 lecture PDFs.
Key insight: Natural language is ambiguous at every level. "bank" = financial institution OR riverbank. "I saw the man with the telescope" has two valid parse trees. This is what makes NLP hard.
| Layer | Studies | Example | NLP Connection |
|---|---|---|---|
| Phonology | Sounds & patterns | The /k/ in "cat" vs "chord" | Speech-to-text, audio models |
| Morphology | Word structure, meaning units | un-break-able → prefix + root + suffix | Tokenization, stemming, lemmatization |
| Syntax | Grammar — how words form sentences | "Dog bites man" ≠ "Man bites dog" | Parsing, POS tagging, grammar check |
| Semantics | Literal meaning of words/sentences | "Colourless green ideas" = syntactically valid but meaningless | Word embeddings, semantic search |
| Pragmatics | Meaning shaped by context/intent | "Can you pass the salt?" = request, not a question | Intent detection, dialogue systems |
These layers are hierarchical and interdependent. Early NLP failed by focusing only on lower layers (phonology, syntax) and ignoring higher ones (pragmatics).
Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human languages — bridging the gap between human communication and computer comprehension. It combines:
How language works — grammar, syntax, semantics, pragmatics. The scientific study of language structure and meaning.
How to process it — algorithms, data structures, parsing, search. The engineering side of handling language computationally.
How to learn from data — pattern recognition, neural networks, embeddings. Learning language representations from massive corpora.
1950s–60s: Rule-based
1980s–90s: Statistical
2000s: Classic ML
2010s: Deep Learning
2017–now: Transformers & LLMs
2.5 quintillion bytes of data produced daily — the vast majority is natural language text. Without NLP, this enormous body of knowledge is invisible to machines.
| Application | Example | NLP Task |
|---|---|---|
| Search Engines | Google parsing "best pizza near me tonight" | Intent detection + location + time extraction |
| Virtual Assistants | Siri, Alexa, Google Assistant | Speech recognition + intent + response generation |
| Machine Translation | Google Translate, DeepL | Seq2Seq modeling, neural translation |
| Spam Filtering | Gmail, Outlook | Text classification |
| Sentiment Analysis | Brand monitoring on social media | Opinion mining, classification |
| Medical NLP | Extracting diagnoses from clinical notes | NER, relation extraction |
| Code Generation | GitHub Copilot, Claude, ChatGPT | Language modeling, code synthesis |
| Autocomplete | Keyboard prediction | Language modeling (next word prediction) |
The machine reads/hears text and extracts structured meaning.
Input: messy human language → Output: structured data, labels, decisions.
Tasks: Sentiment Classification, Named Entity Recognition, Intent Detection, Question Answering, Fact Extraction
The machine produces coherent, fluent text from structured data.
Input: data or prompts → Output: human-readable language.
Tasks: Summarization, Machine Translation, Dialogue Response, Report Generation, Code Generation
Remove noise before any processing. Each step has a specific purpose — order matters:
| # | Step | Code | Why |
|---|---|---|---|
| 1 | Lowercase | text.lower() | Reduces vocabulary ("Park" = "park") |
| 2 | Remove HTML tags | re.sub(r'<[^>]+>', ' ', text) | Web-scraped data contains markup |
| 3 | Remove URLs | re.sub(r'http\S+|www\S+', ' ', text) | URLs are noise for most NLP tasks |
| 4 | Remove punctuation | re.sub(r'[^\w\s]', ' ', text) | Punctuation rarely carries meaning |
| 5 | Remove numbers (optional) | re.sub(r'\d+', ' ', text) | Task-dependent — keep for financial data |
| 6 | Normalize whitespace | re.sub(r'\s+', ' ', text).strip() | Steps 2–5 leave extra spaces |
Split text into individual units (tokens). Modern LLMs use subword tokenization (BPE — Byte Pair Encoding) to handle rare and unknown words.
Remove high-frequency, low-information words: "the", "is", "a", "in", "are" — focus on content-bearing tokens.
Reduce words to their base form so "run", "runs", "running", "ran" are treated as the same concept.
| Word | Stemming | Lemmatization |
|---|---|---|
| running | run | run |
| studies | studi (crude!) | study (correct) |
| better | better | good (uses POS) |
Stemming: Fast, crude — chops off suffixes (may produce non-words like "studi").
Lemmatization: Slower, accurate — uses vocabulary and POS context to find the proper base form.
Assign a grammatical role to each token. The same word can be different parts of speech depending on usage.
Identify and classify specific real-world objects with proper names. Critical for knowledge graphs and information extraction.
Analyze grammatical relationships between words — which word is the subject, object, modifier of which other word. Answers: who did what to whom?
A formal system for defining language as production rules. Foundation of early NLP parsers — reveals WHY language is so hard to parse.
CFGs cannot resolve such ambiguities — natural language regularly violates CFG assumptions. This drove the move to probabilistic and neural approaches.
Represent a document as a vector of word counts. Simple and effective for classification, but ignores word order and context entirely.
Advantages: Fast, interpretable, no training required.
Disadvantages: Sparse matrix (99% zeros), negation invisible ("not good" ≈ "good"), synonyms unlinked, order lost.
Fixes BoW's problem of common words dominating. A word scores high if it appears often in THIS document but rarely across ALL documents.
$N$: total documents · $df(t)$: documents containing term $t$ · High IDF = rare across corpus = informative
Advantages: Reduces stop word dominance, interpretable, fast (140× faster than BERT).
Disadvantages: Synonyms ignored ("great" ≠ "excellent"), word order lost, blind to polysemy ("bank").
Train a neural network to predict context words. Discard the network; keep learned weights as dense vector representations. Words with similar meanings cluster in vector space.
Limitation: One fixed vector per word → "bank" (financial) = "bank" (riverbank). Polysemy is ignored.
Word2Vec gives each word ONE fixed vector. Contextual embeddings compute a different vector for each occurrence based on surrounding context:
Every generation solved the previous generation's fundamental limitation:
| Era | Approach | Key Method | Solved / Failed |
|---|---|---|---|
| 1950s–60s | Rule-Based | Hand-written grammar rules | Precise but brittle. Fails on ambiguity and variations. |
| 1980s–90s | Statistical | N-gram language models, HMMs | Handles variation. Fails on long-range dependencies. |
| 2000s | Classic ML | SVM, Naive Bayes, feature engineering | Better generalization. Manual features still needed. |
| 2010s | Deep Learning | RNN/LSTM, word embeddings | Automatic features. Sequential bottleneck. |
| 2017–now | Transformers & LLMs | BERT, GPT, T5, Claude, ChatGPT | Parallel, contextual, scalable. Best performance. |
The through-line: Rules → Counts → Dense vectors → Contextual vectors → World knowledge encoded in billions of parameters. Each era is a better answer to the same question. Increasing complexity → Better understanding of meaning → More resources required.