Human language is ambiguous, context-dependent, and far more complex than any formal language. The same sentence can have multiple valid parsings (syntactic ambiguity), the same word can mean different things (lexical ambiguity), and pragmatic meaning often diverges from literal meaning. NLP systems must handle these challenges alongside practical issues: coreference resolution, sarcasm detection, multilingual processing, low-resource languages, and the ever-present problem of bias in training data. Understanding these challenges explains both the impressive capabilities and the surprising failures of modern NLP systems.
The fundamental challenges
| Challenge | Example | Why it is hard |
|---|---|---|
| Lexical ambiguity | "I saw the bat" — animal or sports equipment? | Same word, multiple meanings; requires world knowledge + context |
| Syntactic ambiguity | "I saw the man with the telescope" — who has the telescope? | Two valid parse trees; requires pragmatic reasoning |
| Coreference | "Alice told Bob she was tired" — who is "she"? | Requires tracking entities across sentences + common sense |
| Pragmatics | "Can you pass the salt?" = request, not question | Literal meaning ≠ intended meaning; requires social context |
| Sarcasm | "Oh great, another Monday" = negative | Tone and context reverse literal meaning |
| World knowledge | "The trophy did not fit in the suitcase because it was too big" — what is too big? | Requires physical world knowledge to resolve "it" |
NLP challenges demonstrated with spaCy and NLTK
import spacy
nlp = spacy.load("en_core_web_sm")
# ── Coreference resolution challenge ──
sentences = [
"Alice told Bob she was tired. He was also exhausted.",
"The trophy did not fit in the suitcase because it was too big.", # Winograd schema
"The council denied the demonstrators a permit because they feared violence.", # "they" = council or demonstrators?
]
for sent in sentences:
doc = nlp(sent)
print(f"Text: {sent}")
print(f"Named entities: {[(ent.text, ent.label_) for ent in doc.ents]}")
# Basic spaCy cannot resolve coreference — requires specialised models
print()
# ── Word sense disambiguation challenge ──
ambiguous = ["I went to the bank to deposit money.",
"The bank of the river was muddy.",
"The children played on the bank of sand."]
print("'bank' in different contexts:")
for sent in ambiguous:
doc = nlp(sent)
bank = [t for t in doc if t.text.lower() == 'bank']
if bank:
print(f" Context: '{sent[:50]}...'")
print(f" Dependency: {bank[0].dep_}, Head: {bank[0].head.text}")
# spaCy gives grammatical role but not semantic disambiguation
# ── Negation handling ──
sentiments = [
"This is not bad at all", # double negation = positive
"I would not say it is good", # = negative
"It is not without merit", # = has merit (double neg)
]
from transformers import pipeline
clf = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
for text in sentiments:
result = clf(text)[0]
print(f"{result['label']:8} ({result['score']:.2f}): {text}")Bias, fairness and safety in NLP
Training data bias: Models trained on internet text inherit biases — gender stereotypes (doctor=male, nurse=female in word embeddings), racial bias, political bias. BERT trained on Wikipedia+BookCorpus shows occupational gender stereotypes. Hallucination: LLMs generate confident-sounding but factually incorrect statements — a fundamental limitation of next-token prediction without explicit knowledge retrieval. Prompt injection: Malicious instructions in user input that override the system prompt. Robustness to adversarial inputs remains an active research challenge.
Detecting and measuring NLP system biases
from transformers import pipeline
# Word embedding bias test (Word Association Test)
unmasker = pipeline('fill-mask', model='bert-base-uncased')
# Occupational bias
templates = [
"The doctor told the nurse that [MASK] needed to update the charts.",
"The engineer finished [MASK] project on time.",
"The nurse greeted [MASK] patient with a smile.",
]
print("Occupational gender bias in BERT:")
for template in templates:
results = unmasker(template)[:3]
pronoun_results = [(r['token_str'], round(r['score'],3))
for r in results if r['token_str'].lower() in
['he', 'she', 'his', 'her', 'they', 'their']]
print(f" '{template[:60]}...'")
print(f" Pronouns: {pronoun_results}")
# "The engineer finished [MASK] project" → "his" (0.81), "the" (0.05), "her" (0.04)
# Sentiment bias test
test_sentences = [
"The {group} doctor examined the patient carefully.",
"The {group} lawyer argued the case effectively.",
"The {group} person was walking down the street.",
]
groups = ['Black', 'White', 'Asian', 'male', 'female']
print("
Sentiment consistency test:")
from transformers import pipeline as clf_pipeline
clf = clf_pipeline('sentiment-analysis')
for template in test_sentences[:1]:
for group in groups:
text = template.format(group=group)
result = clf(text)[0]
print(f" {result['label']:8} ({result['score']:.3f}): {text}")Why NLP is still an open problem
Despite GPT-4 passing the bar exam and medical licensing exam, NLP systems still fail on: (1) Simple negation ("A is not B" confused with "A is B"). (2) Common sense ("Can a crocodile run a steeplechase?"). (3) Consistent multi-step reasoning. (4) Novel language constructs. (5) Low-resource languages (96% of world languages have minimal NLP research). These gaps motivate ongoing research in grounding, reasoning, and robustness.
Practice questions
- "Visiting relatives can be boring." What type of ambiguity is this and what are the two interpretations? (Answer: Syntactic ambiguity (also called structural ambiguity). Interpretation 1: The act of visiting relatives is boring (visiting = gerund, subject of "can be boring"). Interpretation 2: Relatives who are visiting can be boring (visiting = adjective modifying relatives).)
- Word2Vec embedding for "doctor" is closer to "he" than "she". What is this and why does it happen? (Answer: Gender bias in word embeddings. The training corpus (Google News) contained more "he is a doctor" than "she is a doctor" due to historical gender imbalances in medical professions and how they were reported. The model learned statistical co-occurrences, not social ideals. Debiasing methods (Bolukbasi et al.) try to remove gender direction from occupational embeddings.)
- An LLM confidently states a wrong historical date. Is this a safety, bias, or hallucination issue? (Answer: Hallucination — the model generates plausible-sounding but factually incorrect text. Next-token prediction maximises coherence, not factual accuracy. The model does not "know" it is wrong. Solutions: RAG (retrieve from trusted sources), fact verification modules, uncertainty quantification.)
- The Winograd Schema "The trophy did not fit in the suitcase because it was too big" — what makes this hard for NLP models? (Answer: Requires physical world knowledge — trophies are usually bigger than suitcases. The word "it" could grammatically refer to either "trophy" or "suitcase". Resolving it correctly requires common-sense reasoning about object sizes, not just linguistic patterns. Winograd Schemas are a benchmark specifically designed to test common-sense reasoning.)
- Why is NLP for low-resource languages (Yoruba, Swahili, Welsh) much harder than for English? (Answer: Less training data means models have fewer examples to learn from. Less labelled data for fine-tuning. Fewer pre-trained models available. Many languages have richer morphology (agglutinative like Finnish, Turkish) that English-centric tokenisers handle poorly. Fewer evaluation benchmarks. Cross-lingual transfer (mBERT, XLM-R) partially addresses this.)
On LumiChats
Understanding NLP challenges explains why LumiChats sometimes makes mistakes: hallucinating facts (LLM limitation), mishandling complex negation, or struggling with ambiguous phrasing. When you notice such issues, rephrasing the question, asking LumiChats to use RAG (search documents), or asking it to reason step-by-step helps significantly.
Try it free