Definition

Natural Language Processing (NLP) is the field of AI concerned with enabling computers to understand, generate, and reason about human language. NLP encompasses tasks from basic text preprocessing to sophisticated language understanding and generation — spanning sentiment analysis, machine translation, question answering, summarization, and conversational AI.

Core NLP tasks

NLP covers a wide spectrum of tasks — all ultimately about enabling machines to work with human language:

Task	Description	Example	Models
Text classification	Assign category label to text	Spam/not-spam, sentiment (pos/neg)	BERT, DistilBERT, SetFit
Named Entity Recognition (NER)	Tag spans with entity type	"Apple [ORG] was founded by Steve Jobs [PER]"	BERT + token classifier, spaCy
Machine translation	Convert text language to language	"Hello" → "Bonjour"	NLLB-200, Helsinki-NLP, DeepL
Summarization	Compress document to key points	Extractive or abstractive	BART, PEGASUS, GPT-4
Question answering	Answer questions from context/knowledge	SQuAD (span extraction), open-domain QA	BERT, RAG, LLMs
Relation extraction	Find relationships between entities	"Elon Musk FOUNDED Tesla"	BERT + classification head
Coreference resolution	Link pronouns to referents	"The CEO said she..." → she = CEO	SpanBERT, neural coref

The NLP preprocessing pipeline

Classical NLP required extensive preprocessing before modeling. Modern neural NLP mostly bypasses this — but understanding it helps interpret legacy systems:

Classical NLP preprocessing vs modern neural approach

# ── CLASSICAL APPROACH (needed for TF-IDF, n-grams, SVM) ──────────────────
import re, string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def classical_preprocess(text: str) -> list[str]:
    text = text.lower()                                # lowercase
    text = re.sub(r'[^a-z0-9\s]', '', text)          # remove punctuation
    tokens = word_tokenize(text)                       # tokenize
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]  # remove stopwords
    lem = WordNetLemmatizer()
    return [lem.lemmatize(t) for t in tokens]          # lemmatize: running→run

# ── MODERN NEURAL APPROACH (BERT/LLM tokenizers) ──────────────────────────
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def neural_preprocess(text: str) -> dict:
    # No manual preprocessing needed — the tokenizer handles everything
    # Uses WordPiece/BPE subword tokenization
    # "running" → ["run", "##ning"] — keeps morphological structure
    return tokenizer(text, truncation=True, max_length=512,
                     return_tensors="pt")
# Neural models learn from raw text — stopwords, casing carry information

Why neural NLP skips preprocessing

Removing stopwords discards context ("not", "no", "never" are stopwords but completely change meaning). Lowercasing loses NER signals ("Apple" vs "apple"). Stemming conflates different senses ("better" → "good" loses comparison). Subword tokenizers handle all cases naturally without human rules.

The evolution: from bag-of-words to LLMs

Era	Paradigm	Key methods	Limitation overcome / introduced
1960s–80s	Rule-based	Hand-written grammars, lexicons, ELIZA	No learning — brittle, can't generalize
1990s–2000s	Statistical NLP	TF-IDF, n-grams, HMMs, Naive Bayes	Learned from data — but bag-of-words, no word order
2013	Word embeddings	Word2Vec, GloVe	Semantic similarity captured — but context-free (one vector per word)
2014–17	Neural NLP	LSTMs, CNNs for text	Learned features — but sequential, slow to train
2018	Transfer learning	BERT (MLM), GPT (LM)	Pretrain once, fine-tune everywhere — transformed NLP benchmarks
2020–present	Foundation models	GPT-3, ChatGPT, LLaMA, Claude	In-context learning, instruction following, emergent capabilities at scale

Linguistic complexity: what AI must handle

Natural language is deceptively complex. Here are the core challenges, with examples of where even frontier LLMs still fail:

Challenge	Example	Why it's hard
Lexical ambiguity	"The bank was steep" (river? financial?)	Same word, multiple meanings — requires context
Structural ambiguity	"I saw the man with the telescope"	Two valid parse trees — only world knowledge resolves
Coreference	"The trophy didn't fit because it was too big"	What is "it"? Requires common sense (Winograd schema)
Implicit knowledge	"I need a plumber. The sink is overflowing."	Causal link requires world knowledge not in text
Pragmatics	"Can you pass the salt?"	Request, not ability question — speech act theory
Negation scope	"The patient has no fever or chills"	Negation spans both — critical in clinical NLP
Sarcasm/irony	"Oh great, another Monday"	Literal meaning opposite to intended — tone-dependent

LLM failure modes on linguistics

LLMs still fail systematically on: negation in complex sentences, Winograd-schema coreference that requires specific world knowledge, long-range syntactic agreement, and multi-hop reasoning that requires combining implicit facts from separate parts of a document.

Evaluation metrics for NLP

Metric	Measures	Used for	Limitation
BLEU	n-gram precision vs reference	Machine translation	Doesn't capture meaning — low correlation with human judgment
ROUGE-L	Longest common subsequence recall	Summarization	Rewards extractive, penalizes creative paraphrase
METEOR	Precision + recall + alignment (synonyms)	Translation, summarization	Slower; synonym matching is language-dependent
BERTScore	Semantic similarity via BERT embeddings	Generation quality	Better than BLEU but can miss factual errors
Perplexity	Model surprise on test text (↓ = better)	Language model quality	Lower ≠ better for tasks — overfit models have low perplexity
MMLU / HellaSwag	Multiple-choice knowledge/reasoning	LLM benchmarking	Saturated — frontier models score > 90%
Human evaluation	Fluency, factuality, helpfulness (gold standard)	Final model assessment	Expensive, slow, inter-annotator disagreement

Modern evaluation practice

BLEU/ROUGE are declining in use as LLM-as-judge evaluations (GPT-4, Claude scoring outputs on rubrics) correlate better with human preferences. MT-Bench, AlpacaEval, and LMSYS Chatbot Arena are the dominant LLM evaluation frameworks in 2025.

Practice questions

What is the pipeline difference between NLP in the 2010s (pre-deep learning) and NLP in 2024? (Answer: 2010s NLP: pipeline of specialised components — tokeniser → POS tagger → NER → parser → coreference resolver → task-specific classifier. Each component was separately trained and errors accumulated through the pipeline. 2024 NLP: a single large language model handles tokenisation, understanding, generation, and all downstream tasks through prompting or fine-tuning. One model replaces the entire pipeline, handles task combinations naturally, and achieves better performance on most tasks.)
What is the Chomsky hierarchy and how do neural LLMs relate to formal language theory? (Answer: Chomsky hierarchy: Regular < Context-Free < Context-Sensitive < Recursively Enumerable languages. Regular: finite automata (regex). Context-Free: pushdown automata (most programming languages). Natural language: roughly context-sensitive (agreement, cross-serial dependencies). Neural LLMs: empirically capable of generating text that follows complex natural language patterns, including context-sensitive phenomena. However, LLMs are not proven to recognise formal languages beyond their training distribution — they approximate statistical patterns rather than parsing formal grammars.)
What is the difference between extractive and abstractive NLP tasks? (Answer: Extractive: the output is a subset of the input — NER (extracting entity spans), extractive QA (finding answer span), extractive summarisation (selecting key sentences). Simpler, more reliable, no hallucination risk. Abstractive: the output requires generating new text not present in the input — abstractive summarisation (paraphrasing), translation, abstractive QA (reasoning across documents). More flexible, more natural-sounding but risk generating incorrect content not supported by the source.)
What is the Turing Test and why is passing it no longer considered sufficient evidence of general language understanding? (Answer: Turing Test (Turing 1950): if a human interrogator cannot distinguish a machine from a human in text conversation, the machine exhibits intelligence. Modern LLMs (GPT-4, Claude) consistently fool human judges in short conversations. However, they still: hallucinate facts, fail basic physical reasoning, lack consistent world models, and cannot reliably solve novel logical puzzles. Passing the Turing Test by producing human-like text is a weaker criterion than general language understanding — it tests conversational fluency, not reasoning.)
What is the 'bitter lesson' (Rich Sutton) and how has it shaped modern NLP? (Answer: The bitter lesson (Sutton 2019): AI progress consistently comes from leveraging computation rather than incorporating human domain knowledge. Hand-crafted features, linguistic rules, and expert-designed NLP pipelines were repeatedly outperformed by general-purpose methods (neural networks) given enough data and compute. The lesson: invest in methods that scale with compute. Modern NLP took this lesson: transformers + massive data + scale replaced hand-crafted NLP pipelines. The 'bitter' part: human expertise in linguistics ultimately mattered less than raw computation.)

Natural Language Processing (NLP)

Core NLP tasks

The NLP preprocessing pipeline

The evolution: from bag-of-words to LLMs

Linguistic complexity: what AI must handle

Evaluation metrics for NLP

Practice questions

Try LumiChats for ₹69

Related Terms