Evaluating NLP systems requires specialised metrics because text output is not a single number. BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between machine and human translations. ROUGE measures recall of n-gram overlap for summarisation evaluation. Perplexity measures how well a language model predicts held-out text — lower is better. BERTScore uses contextual embeddings for semantic similarity rather than exact word overlap. These metrics are imperfect but are the standard benchmarks used in academic and industry NLP evaluation.
BLEU — evaluating translation quality
BLEU: pₙ = modified n-gram precision (how many n-grams in hypothesis match references). BP = brevity penalty (punishes short translations). wₙ = uniform weights (1/N). Standard: N=4, uniform weights. BLEU ∈ [0,1], higher is better. BLEU=1.0 = perfect match with reference. Human translator BLEU ≈ 0.6.
BLEU, ROUGE, BERTScore computation
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import numpy as np
# ── BLEU for machine translation ──
# References: one or more human translations (list of lists)
# Hypotheses: machine translations (list of token lists)
references = [
[['the', 'cat', 'is', 'on', 'the', 'mat']], # 1 reference for sent 1
[['the', 'dog', 'barked', 'at', 'the', 'mailman']], # 1 reference for sent 2
]
hypotheses = [
['the', 'cat', 'is', 'on', 'mat'], # Missing one 'the'
['the', 'dog', 'barked', 'at', 'a', 'man'], # 'the mailman' → 'a man'
]
# Corpus-level BLEU (standard for MT evaluation)
bleu_score = corpus_bleu(references, hypotheses)
print(f"Corpus BLEU: {bleu_score:.4f}")
# Sentence-level BLEU with smoothing
smooth = SmoothingFunction().method1
for ref, hyp in zip(references, hypotheses):
score = sentence_bleu(ref, hyp, smoothing_function=smooth)
print(f"Sentence BLEU: {score:.4f} | Hyp: {' '.join(hyp)}")
# Individual n-gram precision breakdown
from nltk.translate.bleu_score import modified_precision
from fractions import Fraction
ref = [['the', 'cat', 'is', 'on', 'the', 'mat']]
hyp = ['the', 'cat', 'is', 'on', 'mat']
for n in range(1, 5):
prec = modified_precision(ref, hyp, n=n)
print(f" P_{n}: {float(prec):.3f}")
# P_1: 0.800 (4 of 5 unigrams match)
# P_2: 0.750 (3 of 4 bigrams match)
# P_3: 0.667 ...
# P_4: 0.500 ...
# ── ROUGE for summarisation ──
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference_summary = """The transformer architecture revolutionised NLP by replacing
recurrent networks with self-attention mechanisms for parallel processing."""
hypothesis1 = """Transformers changed NLP by using self-attention instead of
recurrent networks, enabling faster parallel training."""
hypothesis2 = """Scientists invented a new cooking technique in 2019."""
for hyp_name, hyp in [("Good", hypothesis1), ("Bad", hypothesis2)]:
scores = scorer.score(reference_summary, hyp)
print(f"
{hyp_name} summary ROUGE:")
for metric, score in scores.items():
print(f" {metric}: P={score.precision:.3f}, R={score.recall:.3f}, F1={score.fmeasure:.3f}")
# ROUGE-1: measures unigram overlap
# ROUGE-2: measures bigram overlap
# ROUGE-L: measures longest common subsequence
# ── BERTScore: semantic similarity beyond exact match ──
try:
from bert_score import score as bert_score
# BERTScore uses contextual BERT embeddings to compare semantic similarity
# Better than BLEU/ROUGE for paraphrases (same meaning, different words)
cands = ["The cat sat on the mat"]
refs = ["A feline rested on the rug"] # Paraphrase — different words, same meaning
P, R, F1 = bert_score(cands, refs, lang='en', verbose=False)
print(f"
BERTScore F1 for paraphrase: {F1.mean():.3f}") # ~0.88 (high!)
# BLEU would give 0 (no word overlap). BERTScore captures semantic equivalence.
except ImportError:
print("pip install bert-score")ROUGE — summarisation evaluation
| Metric | What it measures | Formula | When high |
|---|---|---|---|
| ROUGE-1 | Unigram recall — individual word coverage | matched unigrams / reference unigrams | Summary covers key vocabulary of reference |
| ROUGE-2 | Bigram recall — phrase coverage | matched bigrams / reference bigrams | Summary preserves key phrases and sequences |
| ROUGE-L | Longest Common Subsequence — fluency | LCS length / reference length | Summary is fluent and structurally similar |
| ROUGE-S | Skip-bigram co-occurrence | Skip-bigrams matched / all skip-bigrams | Summary preserves word relationships |
Limitations of BLEU and ROUGE
BLEU and ROUGE measure surface overlap — not semantic quality. A paraphrase ("quick brown fox" → "fast auburn fox") scores BLEU=0 despite being a perfect translation. A repetitive summary ("The cat sat on the mat. The cat sat on the mat.") can score high ROUGE despite being useless. Modern evaluation increasingly uses human ratings, BERTScore, or LLM-as-judge approaches (GPT-4 evaluating model outputs).
Perplexity — evaluating language models
Perplexity = exponentiated average negative log-likelihood. Equivalent to the geometric mean of inverse probabilities. Lower is better. PPL=1: perfect (always predicts correct next token). PPL=50: on average, model is equally unsure between 50 candidate tokens. GPT-4 achieves PPL≈5-10 on clean news text.
Computing perplexity on different text types
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch, math
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
def perplexity(text: str, stride: int = 512) -> float:
"""Compute perplexity with stride trick for long texts."""
encodings = tokenizer(text, return_tensors='pt')
max_length = model.config.n_positions # 1024 for GPT-2
seq_len = encodings.input_ids.size(1)
nlls, n_tokens = [], 0
for begin in range(0, seq_len, stride):
end = min(begin + max_length, seq_len)
target_len = end - begin
input_ids = encodings.input_ids[:, begin:end]
target_ids = input_ids.clone()
target_ids[:, :-target_len] = -100 # Ignore context tokens in loss
with torch.no_grad():
nll = model(input_ids, labels=target_ids).loss
nlls.append(nll * target_len)
n_tokens += target_len
return math.exp(torch.stack(nlls).sum() / n_tokens)
texts = {
'Simple English': "The cat sat on the mat. The dog ran in the park.",
'Scientific': "Transformer architectures utilise multi-head self-attention mechanisms.",
'Random': "Purple elephant dances quantum philosophy seventeen banana.",
'Code': "def fibonacci(n): return n if n < 2 else fibonacci(n-1) + fibonacci(n-2)",
}
for label, text in texts.items():
ppl = perplexity(text)
print(f"PPL={ppl:6.1f}: {label}")
# Simple English: low PPL (well-represented in training)
# Scientific: medium PPL (technical vocabulary)
# Random: very high PPL (semantically incoherent)
# Code: medium-high PPL (GPT-2 saw little code)Practice questions
- Machine translation BLEU = 0.65. Is this good? (Answer: Yes — human-level translation achieves BLEU ≈ 0.60-0.70 on standard benchmarks. BLEU > 0.6 is considered near-human quality. Commercial MT systems (Google Translate) typically achieve BLEU 0.50-0.65 on standard WMT benchmarks.)
- Why does BLEU use a brevity penalty? (Answer: Without it, a model could achieve perfect precision by outputting just one highly probable word (e.g., "the") that appears in all references. P_1 = 1.0 but the output is useless. Brevity penalty multiplies by e^(1-r/c) when the hypothesis c is shorter than reference r — penalising short translations.)
- ROUGE-1 = 0.9, ROUGE-2 = 0.3 for a summary. What does this suggest? (Answer: High ROUGE-1 means the summary covers key individual words from the reference. Low ROUGE-2 means few consecutive word pairs match — the summary may have rearranged key terms without preserving phrase structure. The summary covers the right vocabulary but possibly in a different order or with different context.)
- A language model has PPL=10 on the training set and PPL=200 on the test set. What is happening? (Answer: Severe overfitting — the model has memorised the training distribution and generates that text easily (PPL=10) but fails to generalise to new text (PPL=200). A good model should have similar PPL on train and test. Train PPL should be slightly lower than test PPL.)
- BERTScore is better than BLEU for paraphrase evaluation. Why? (Answer: BLEU counts exact word overlap — "quick fox" and "fast fox" share only one word, giving low BLEU. BERTScore encodes both texts with BERT and computes cosine similarity of contextual embeddings — semantically similar words (quick≈fast, fox≈fox) score high even without exact match. BERTScore is more aligned with human judgments of translation quality.)
On LumiChats
LumiChats performance is evaluated using ROUGE for summarisation quality, BERTScore for response relevance, and human preference ratings. Understanding these metrics helps you interpret when LumiChats explains its confidence or why a summarisation might miss certain details.
Try it free