Glossary/Natural Language Processing (NLP)
Natural Language Processing

Natural Language Processing (NLP)

Teaching machines to understand and generate human language.


Definition

Natural Language Processing (NLP) is the field of AI concerned with enabling computers to understand, generate, and reason about human language. NLP encompasses tasks from basic text preprocessing to sophisticated language understanding and generation — spanning sentiment analysis, machine translation, question answering, summarization, and conversational AI.

Core NLP tasks

NLP covers a wide spectrum of tasks — all ultimately about enabling machines to work with human language:

TaskDescriptionExampleModels
Text classificationAssign category label to textSpam/not-spam, sentiment (pos/neg)BERT, DistilBERT, SetFit
Named Entity Recognition (NER)Tag spans with entity type"Apple [ORG] was founded by Steve Jobs [PER]"BERT + token classifier, spaCy
Machine translationConvert text language to language"Hello" → "Bonjour"NLLB-200, Helsinki-NLP, DeepL
SummarizationCompress document to key pointsExtractive or abstractiveBART, PEGASUS, GPT-4
Question answeringAnswer questions from context/knowledgeSQuAD (span extraction), open-domain QABERT, RAG, LLMs
Relation extractionFind relationships between entities"Elon Musk FOUNDED Tesla"BERT + classification head
Coreference resolutionLink pronouns to referents"The CEO said she..." → she = CEOSpanBERT, neural coref

The NLP preprocessing pipeline

Classical NLP required extensive preprocessing before modeling. Modern neural NLP mostly bypasses this — but understanding it helps interpret legacy systems:

Classical NLP preprocessing vs modern neural approach

# ── CLASSICAL APPROACH (needed for TF-IDF, n-grams, SVM) ──────────────────
import re, string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def classical_preprocess(text: str) -> list[str]:
    text = text.lower()                                # lowercase
    text = re.sub(r'[^a-z0-9\s]', '', text)          # remove punctuation
    tokens = word_tokenize(text)                       # tokenize
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]  # remove stopwords
    lem = WordNetLemmatizer()
    return [lem.lemmatize(t) for t in tokens]          # lemmatize: running→run

# ── MODERN NEURAL APPROACH (BERT/LLM tokenizers) ──────────────────────────
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def neural_preprocess(text: str) -> dict:
    # No manual preprocessing needed — the tokenizer handles everything
    # Uses WordPiece/BPE subword tokenization
    # "running" → ["run", "##ning"] — keeps morphological structure
    return tokenizer(text, truncation=True, max_length=512,
                     return_tensors="pt")
# Neural models learn from raw text — stopwords, casing carry information

Why neural NLP skips preprocessing

Removing stopwords discards context ("not", "no", "never" are stopwords but completely change meaning). Lowercasing loses NER signals ("Apple" vs "apple"). Stemming conflates different senses ("better" → "good" loses comparison). Subword tokenizers handle all cases naturally without human rules.

The evolution: from bag-of-words to LLMs

EraParadigmKey methodsLimitation overcome / introduced
1960s–80sRule-basedHand-written grammars, lexicons, ELIZANo learning — brittle, can't generalize
1990s–2000sStatistical NLPTF-IDF, n-grams, HMMs, Naive BayesLearned from data — but bag-of-words, no word order
2013Word embeddingsWord2Vec, GloVeSemantic similarity captured — but context-free (one vector per word)
2014–17Neural NLPLSTMs, CNNs for textLearned features — but sequential, slow to train
2018Transfer learningBERT (MLM), GPT (LM)Pretrain once, fine-tune everywhere — transformed NLP benchmarks
2020–presentFoundation modelsGPT-3, ChatGPT, LLaMA, ClaudeIn-context learning, instruction following, emergent capabilities at scale

Linguistic complexity: what AI must handle

Natural language is deceptively complex. Here are the core challenges, with examples of where even frontier LLMs still fail:

ChallengeExampleWhy it's hard
Lexical ambiguity"The bank was steep" (river? financial?)Same word, multiple meanings — requires context
Structural ambiguity"I saw the man with the telescope"Two valid parse trees — only world knowledge resolves
Coreference"The trophy didn't fit because it was too big"What is "it"? Requires common sense (Winograd schema)
Implicit knowledge"I need a plumber. The sink is overflowing."Causal link requires world knowledge not in text
Pragmatics"Can you pass the salt?"Request, not ability question — speech act theory
Negation scope"The patient has no fever or chills"Negation spans both — critical in clinical NLP
Sarcasm/irony"Oh great, another Monday"Literal meaning opposite to intended — tone-dependent

LLM failure modes on linguistics

LLMs still fail systematically on: negation in complex sentences, Winograd-schema coreference that requires specific world knowledge, long-range syntactic agreement, and multi-hop reasoning that requires combining implicit facts from separate parts of a document.

Evaluation metrics for NLP

MetricMeasuresUsed forLimitation
BLEUn-gram precision vs referenceMachine translationDoesn't capture meaning — low correlation with human judgment
ROUGE-LLongest common subsequence recallSummarizationRewards extractive, penalizes creative paraphrase
METEORPrecision + recall + alignment (synonyms)Translation, summarizationSlower; synonym matching is language-dependent
BERTScoreSemantic similarity via BERT embeddingsGeneration qualityBetter than BLEU but can miss factual errors
PerplexityModel surprise on test text (↓ = better)Language model qualityLower ≠ better for tasks — overfit models have low perplexity
MMLU / HellaSwagMultiple-choice knowledge/reasoningLLM benchmarkingSaturated — frontier models score > 90%
Human evaluationFluency, factuality, helpfulness (gold standard)Final model assessmentExpensive, slow, inter-annotator disagreement

Modern evaluation practice

BLEU/ROUGE are declining in use as LLM-as-judge evaluations (GPT-4, Claude scoring outputs on rubrics) correlate better with human preferences. MT-Bench, AlpacaEval, and LMSYS Chatbot Arena are the dominant LLM evaluation frameworks in 2025.

Practice questions

  1. What is the pipeline difference between NLP in the 2010s (pre-deep learning) and NLP in 2024? (Answer: 2010s NLP: pipeline of specialised components — tokeniser → POS tagger → NER → parser → coreference resolver → task-specific classifier. Each component was separately trained and errors accumulated through the pipeline. 2024 NLP: a single large language model handles tokenisation, understanding, generation, and all downstream tasks through prompting or fine-tuning. One model replaces the entire pipeline, handles task combinations naturally, and achieves better performance on most tasks.)
  2. What is the Chomsky hierarchy and how do neural LLMs relate to formal language theory? (Answer: Chomsky hierarchy: Regular < Context-Free < Context-Sensitive < Recursively Enumerable languages. Regular: finite automata (regex). Context-Free: pushdown automata (most programming languages). Natural language: roughly context-sensitive (agreement, cross-serial dependencies). Neural LLMs: empirically capable of generating text that follows complex natural language patterns, including context-sensitive phenomena. However, LLMs are not proven to recognise formal languages beyond their training distribution — they approximate statistical patterns rather than parsing formal grammars.)
  3. What is the difference between extractive and abstractive NLP tasks? (Answer: Extractive: the output is a subset of the input — NER (extracting entity spans), extractive QA (finding answer span), extractive summarisation (selecting key sentences). Simpler, more reliable, no hallucination risk. Abstractive: the output requires generating new text not present in the input — abstractive summarisation (paraphrasing), translation, abstractive QA (reasoning across documents). More flexible, more natural-sounding but risk generating incorrect content not supported by the source.)
  4. What is the Turing Test and why is passing it no longer considered sufficient evidence of general language understanding? (Answer: Turing Test (Turing 1950): if a human interrogator cannot distinguish a machine from a human in text conversation, the machine exhibits intelligence. Modern LLMs (GPT-4, Claude) consistently fool human judges in short conversations. However, they still: hallucinate facts, fail basic physical reasoning, lack consistent world models, and cannot reliably solve novel logical puzzles. Passing the Turing Test by producing human-like text is a weaker criterion than general language understanding — it tests conversational fluency, not reasoning.)
  5. What is the 'bitter lesson' (Rich Sutton) and how has it shaped modern NLP? (Answer: The bitter lesson (Sutton 2019): AI progress consistently comes from leveraging computation rather than incorporating human domain knowledge. Hand-crafted features, linguistic rules, and expert-designed NLP pipelines were repeatedly outperformed by general-purpose methods (neural networks) given enough data and compute. The lesson: invest in methods that scale with compute. Modern NLP took this lesson: transformers + massive data + scale replaced hand-crafted NLP pipelines. The 'bitter' part: human expertise in linguistics ultimately mattered less than raw computation.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms