Natural Language Processing (NLP) is the field of AI concerned with enabling computers to understand, generate, and reason about human language. NLP encompasses tasks from basic text preprocessing to sophisticated language understanding and generation — spanning sentiment analysis, machine translation, question answering, summarization, and conversational AI.
Core NLP tasks
NLP covers a wide spectrum of tasks — all ultimately about enabling machines to work with human language:
| Task | Description | Example | Models |
|---|---|---|---|
| Text classification | Assign category label to text | Spam/not-spam, sentiment (pos/neg) | BERT, DistilBERT, SetFit |
| Named Entity Recognition (NER) | Tag spans with entity type | "Apple [ORG] was founded by Steve Jobs [PER]" | BERT + token classifier, spaCy |
| Machine translation | Convert text language to language | "Hello" → "Bonjour" | NLLB-200, Helsinki-NLP, DeepL |
| Summarization | Compress document to key points | Extractive or abstractive | BART, PEGASUS, GPT-4 |
| Question answering | Answer questions from context/knowledge | SQuAD (span extraction), open-domain QA | BERT, RAG, LLMs |
| Relation extraction | Find relationships between entities | "Elon Musk FOUNDED Tesla" | BERT + classification head |
| Coreference resolution | Link pronouns to referents | "The CEO said she..." → she = CEO | SpanBERT, neural coref |
The NLP preprocessing pipeline
Classical NLP required extensive preprocessing before modeling. Modern neural NLP mostly bypasses this — but understanding it helps interpret legacy systems:
Classical NLP preprocessing vs modern neural approach
# ── CLASSICAL APPROACH (needed for TF-IDF, n-grams, SVM) ──────────────────
import re, string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def classical_preprocess(text: str) -> list[str]:
text = text.lower() # lowercase
text = re.sub(r'[^a-z0-9\s]', '', text) # remove punctuation
tokens = word_tokenize(text) # tokenize
stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words] # remove stopwords
lem = WordNetLemmatizer()
return [lem.lemmatize(t) for t in tokens] # lemmatize: running→run
# ── MODERN NEURAL APPROACH (BERT/LLM tokenizers) ──────────────────────────
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def neural_preprocess(text: str) -> dict:
# No manual preprocessing needed — the tokenizer handles everything
# Uses WordPiece/BPE subword tokenization
# "running" → ["run", "##ning"] — keeps morphological structure
return tokenizer(text, truncation=True, max_length=512,
return_tensors="pt")
# Neural models learn from raw text — stopwords, casing carry informationWhy neural NLP skips preprocessing
Removing stopwords discards context ("not", "no", "never" are stopwords but completely change meaning). Lowercasing loses NER signals ("Apple" vs "apple"). Stemming conflates different senses ("better" → "good" loses comparison). Subword tokenizers handle all cases naturally without human rules.
The evolution: from bag-of-words to LLMs
| Era | Paradigm | Key methods | Limitation overcome / introduced |
|---|---|---|---|
| 1960s–80s | Rule-based | Hand-written grammars, lexicons, ELIZA | No learning — brittle, can't generalize |
| 1990s–2000s | Statistical NLP | TF-IDF, n-grams, HMMs, Naive Bayes | Learned from data — but bag-of-words, no word order |
| 2013 | Word embeddings | Word2Vec, GloVe | Semantic similarity captured — but context-free (one vector per word) |
| 2014–17 | Neural NLP | LSTMs, CNNs for text | Learned features — but sequential, slow to train |
| 2018 | Transfer learning | BERT (MLM), GPT (LM) | Pretrain once, fine-tune everywhere — transformed NLP benchmarks |
| 2020–present | Foundation models | GPT-3, ChatGPT, LLaMA, Claude | In-context learning, instruction following, emergent capabilities at scale |
Linguistic complexity: what AI must handle
Natural language is deceptively complex. Here are the core challenges, with examples of where even frontier LLMs still fail:
| Challenge | Example | Why it's hard |
|---|---|---|
| Lexical ambiguity | "The bank was steep" (river? financial?) | Same word, multiple meanings — requires context |
| Structural ambiguity | "I saw the man with the telescope" | Two valid parse trees — only world knowledge resolves |
| Coreference | "The trophy didn't fit because it was too big" | What is "it"? Requires common sense (Winograd schema) |
| Implicit knowledge | "I need a plumber. The sink is overflowing." | Causal link requires world knowledge not in text |
| Pragmatics | "Can you pass the salt?" | Request, not ability question — speech act theory |
| Negation scope | "The patient has no fever or chills" | Negation spans both — critical in clinical NLP |
| Sarcasm/irony | "Oh great, another Monday" | Literal meaning opposite to intended — tone-dependent |
LLM failure modes on linguistics
LLMs still fail systematically on: negation in complex sentences, Winograd-schema coreference that requires specific world knowledge, long-range syntactic agreement, and multi-hop reasoning that requires combining implicit facts from separate parts of a document.
Evaluation metrics for NLP
| Metric | Measures | Used for | Limitation |
|---|---|---|---|
| BLEU | n-gram precision vs reference | Machine translation | Doesn't capture meaning — low correlation with human judgment |
| ROUGE-L | Longest common subsequence recall | Summarization | Rewards extractive, penalizes creative paraphrase |
| METEOR | Precision + recall + alignment (synonyms) | Translation, summarization | Slower; synonym matching is language-dependent |
| BERTScore | Semantic similarity via BERT embeddings | Generation quality | Better than BLEU but can miss factual errors |
| Perplexity | Model surprise on test text (↓ = better) | Language model quality | Lower ≠ better for tasks — overfit models have low perplexity |
| MMLU / HellaSwag | Multiple-choice knowledge/reasoning | LLM benchmarking | Saturated — frontier models score > 90% |
| Human evaluation | Fluency, factuality, helpfulness (gold standard) | Final model assessment | Expensive, slow, inter-annotator disagreement |
Modern evaluation practice
BLEU/ROUGE are declining in use as LLM-as-judge evaluations (GPT-4, Claude scoring outputs on rubrics) correlate better with human preferences. MT-Bench, AlpacaEval, and LMSYS Chatbot Arena are the dominant LLM evaluation frameworks in 2025.
Practice questions
- What is the pipeline difference between NLP in the 2010s (pre-deep learning) and NLP in 2024? (Answer: 2010s NLP: pipeline of specialised components — tokeniser → POS tagger → NER → parser → coreference resolver → task-specific classifier. Each component was separately trained and errors accumulated through the pipeline. 2024 NLP: a single large language model handles tokenisation, understanding, generation, and all downstream tasks through prompting or fine-tuning. One model replaces the entire pipeline, handles task combinations naturally, and achieves better performance on most tasks.)
- What is the Chomsky hierarchy and how do neural LLMs relate to formal language theory? (Answer: Chomsky hierarchy: Regular < Context-Free < Context-Sensitive < Recursively Enumerable languages. Regular: finite automata (regex). Context-Free: pushdown automata (most programming languages). Natural language: roughly context-sensitive (agreement, cross-serial dependencies). Neural LLMs: empirically capable of generating text that follows complex natural language patterns, including context-sensitive phenomena. However, LLMs are not proven to recognise formal languages beyond their training distribution — they approximate statistical patterns rather than parsing formal grammars.)
- What is the difference between extractive and abstractive NLP tasks? (Answer: Extractive: the output is a subset of the input — NER (extracting entity spans), extractive QA (finding answer span), extractive summarisation (selecting key sentences). Simpler, more reliable, no hallucination risk. Abstractive: the output requires generating new text not present in the input — abstractive summarisation (paraphrasing), translation, abstractive QA (reasoning across documents). More flexible, more natural-sounding but risk generating incorrect content not supported by the source.)
- What is the Turing Test and why is passing it no longer considered sufficient evidence of general language understanding? (Answer: Turing Test (Turing 1950): if a human interrogator cannot distinguish a machine from a human in text conversation, the machine exhibits intelligence. Modern LLMs (GPT-4, Claude) consistently fool human judges in short conversations. However, they still: hallucinate facts, fail basic physical reasoning, lack consistent world models, and cannot reliably solve novel logical puzzles. Passing the Turing Test by producing human-like text is a weaker criterion than general language understanding — it tests conversational fluency, not reasoning.)
- What is the 'bitter lesson' (Rich Sutton) and how has it shaped modern NLP? (Answer: The bitter lesson (Sutton 2019): AI progress consistently comes from leveraging computation rather than incorporating human domain knowledge. Hand-crafted features, linguistic rules, and expert-designed NLP pipelines were repeatedly outperformed by general-purpose methods (neural networks) given enough data and compute. The lesson: invest in methods that scale with compute. Modern NLP took this lesson: transformers + massive data + scale replaced hand-crafted NLP pipelines. The 'bitter' part: human expertise in linguistics ultimately mattered less than raw computation.)