LLM benchmarks are standardised test suites measuring specific capabilities: MMLU (multitask knowledge), HumanEval (code generation), GSM8K (math reasoning), HellaSwag (commonsense), MATH (competition mathematics), and MT-Bench (instruction following). Benchmark scores are essential for comparing models but have well-known limitations — benchmark saturation, data contamination (training on test data), and poor correlation with real-world deployment performance. The industry increasingly combines automated benchmarks with human evaluation, A/B testing in production, and task-specific evaluation suites.
Major LLM benchmarks and what they measure
| Benchmark | Measures | Format | Human baseline | GPT-4 score |
|---|---|---|---|---|
| MMLU | Knowledge across 57 subjects (law, medicine, CS, history) | Multiple choice, 4 options | ~88% | ~87% |
| HumanEval | Python code generation correctness | Complete function from docstring | N/A | ~67% pass@1 |
| GSM8K | Grade school math word problems | Free-form reasoning + answer | ~98% | ~92% |
| MATH | Competition mathematics (AMC, AIME level) | Multi-step problem solving | ~40% | ~42% |
| HellaSwag | Physical commonsense (activity completion) | Multiple choice sentence completion | ~95% | ~95% |
| MT-Bench | Multi-turn instruction following quality | GPT-4 judges 1-10 score | N/A | 8.99/10 |
| BIG-Bench Hard | Hard reasoning tasks requiring multi-step | Multiple choice | N/A | Varies widely |
Running LLM benchmarks with lm-evaluation-harness
# The Eleuther AI Language Model Evaluation Harness is the standard tool
# pip install lm-eval
# Command-line evaluation (most common pattern)
# Evaluate Llama-3.2-1B on MMLU and GSM8K
"""
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct --tasks mmlu,gsm8k --device cuda:0 --batch_size 8 --output_path ./results/llama_1b
"""
# Python API evaluation
from lm_eval import simple_evaluate
from lm_eval.models.huggingface import HFLM
import json
# Load model for evaluation
lm_obj = HFLM(pretrained="unsloth/Llama-3.2-1B-Instruct",
dtype="bfloat16", device="cuda")
# Run evaluation on multiple benchmarks
results = simple_evaluate(
model=lm_obj,
tasks=["mmlu", "gsm8k", "hellaswag"],
num_fewshot={"mmlu": 5, "gsm8k": 8, "hellaswag": 10}, # Few-shot examples
batch_size=8,
)
# Print results
for task, metrics in results["results"].items():
acc = metrics.get("acc,none", metrics.get("acc_norm,none", "N/A"))
print(f"{task:20}: {acc:.3f}" if isinstance(acc, float) else f"{task:20}: {acc}")
# ── Custom benchmark for your specific use case ──
# Standard benchmarks rarely match production requirements
# Build task-specific evaluation suites
def evaluate_sql_generation(model, tokenizer, test_cases):
"""Evaluate model on SQL generation for your schema."""
correct = 0
for prompt, expected_sql, db_schema in test_cases:
full_prompt = f"Schema: {db_schema}
Question: {prompt}
SQL:"
inputs = tokenizer(full_prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0)
generated_sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Execute both SQLs and compare results
try:
result_generated = execute_sql(generated_sql)
result_expected = execute_sql(expected_sql)
if set(map(tuple, result_generated)) == set(map(tuple, result_expected)):
correct += 1
except:
pass # SQL error = wrong
return correct / len(test_cases)
# ── LLM-as-judge evaluation (MT-Bench style) ──
from openai import OpenAI
client = OpenAI()
def llm_judge_response(question: str, response: str, reference: str = None) -> dict:
"""Use GPT-4o-mini as an evaluator (cheaper than GPT-4)."""
rubric = """Rate this response 1-10 on:
- Accuracy (is it factually correct?)
- Completeness (does it fully answer the question?)
- Clarity (is it easy to understand?)
Provide a JSON: {"scores": {"accuracy": X, "completeness": X, "clarity": X}, "reasoning": "..."}"""
eval_prompt = f"Question: {question}
Response: {response}
{rubric}"
if reference:
eval_prompt += f"
Reference answer: {reference}"
result = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": eval_prompt}],
response_format={"type": "json_object"}
)
return json.loads(result.choices[0].message.content)Benchmark limitations and real-world evaluation
- Benchmark contamination: Models trained on internet text may have seen benchmark test sets. MMLU questions appear in many online study guides. Contaminated models score artificially high — not because they generalise, but because they memorised test questions during pretraining.
- Benchmark saturation: GPT-4 scores ~88% on MMLU (same as human average). This does not mean GPT-4 has human-level knowledge — it means MMLU is too easy to differentiate frontier models. New harder benchmarks (GPQA, ARC-AGI) are constantly needed.
- Distribution mismatch: MMLU measures multiple-choice test performance. Production LLMs primarily answer open-ended questions, write code, and hold conversations. High MMLU score does not guarantee good conversational ability.
- Goodhart's Law in benchmarks: Once a benchmark is widely used, developers optimise specifically for it. Models can be fine-tuned to ace MMLU without improving general knowledge.
Gold standard: Chatbot Arena (LMSYS)
Chatbot Arena (lmsys.org/chat) is the most trustworthy LLM ranking: users submit prompts, two anonymous models respond, user picks the winner. Results aggregate to Elo ratings. Unlike static benchmarks, Arena reflects diverse real-world usage, is contamination-resistant (new prompts every day), and is extremely hard to game. Claude, GPT-4o, and Gemini Ultra compete at the top of this leaderboard.
Practice questions
- Model A scores 87% on MMLU, Model B scores 82%. Does this mean Model A is better for production use? (Answer: Not necessarily. MMLU measures academic knowledge in multiple-choice format. Production performance depends on the specific task: code generation, conversation quality, instruction following, safety. Always evaluate on your specific use case. Model B might score higher on HumanEval (code) or have lower latency for your response time requirements.)
- What is benchmark contamination and why does it matter? (Answer: LLMs pretrain on internet text which includes benchmark test sets. A model that has memorised MMLU questions scores high without truly understanding the material — analogous to cheating on an exam. Contamination makes it hard to fairly compare models and overestimates capabilities. Detection: check if accuracy on held-out variants drops significantly.)
- Why is Chatbot Arena (Elo-based) considered more reliable than static benchmarks? (Answer: User prompts are diverse, fresh (daily new prompts), and match real-world use patterns. Anonymous comparison eliminates bias toward known models. Elo system averages thousands of real preferences. No fixed answer key = no contamination. Hard to game — you cannot train specifically on tomorrow's user prompts.)
- pass@k in HumanEval measures what? (Answer: The probability that at least 1 of k generated code samples passes all unit tests. pass@1 = accuracy with one attempt. pass@10 = probability of at least one correct solution in 10 attempts. Higher k → more chances to get it right. Production code assistants effectively use pass@10+ since users can ask for regeneration.)
- Your fine-tuned model scores 95% on your custom evaluation dataset but performs poorly in production. What might explain this? (Answer: Overfitting to the evaluation dataset (if it overlaps with fine-tuning data). Distribution shift between evaluation examples and real user queries. Evaluation prompts may be easier than real prompts (cherry-picked). Automated metrics miss important quality dimensions. Solution: use a held-out test set never seen during training, add human evaluation of production samples.)
On LumiChats
LumiChats is evaluated using a combination of automated benchmarks (MMLU, HumanEval, MT-Bench), human preference ratings (similar to Chatbot Arena), and production A/B testing. Understanding these evaluation frameworks helps you interpret capability claims about AI products critically.
Try it free