Definition

A language model assigns a probability to sequences of tokens — it models P(w₁, w₂, ..., wₙ). Training objective: predict the next token (causal LM, autoregressive) or predict masked tokens (masked LM). N-gram models use simple counting from a corpus. Neural LMs use RNNs or LSTMs. Transformer LMs (GPT, Claude, Llama) use self-attention and achieve human-level text generation. Language modelling is the foundation task that gives LLMs their broad world knowledge and text understanding. Perplexity measures how well a language model predicts text.

Real-life analogy: The auto-complete engine

When you type 'The capital of France is...' into your phone, autocomplete suggests 'Paris'. Your phone has a language model. A basic one uses the last 2-3 words to predict the next (n-gram). GPT-4 uses ALL preceding tokens — the entire prompt — to predict the next token. Both are doing the same thing: assigning probabilities to 'what comes next'. The difference is how much context they use and how well they model long-range dependencies.

N-gram language models

N-gram Markov assumption: probability of word wᵢ depends only on the preceding N-1 words. Estimated by counting from a training corpus. C() = count. Smoothing (Laplace, Kneser-Ney) handles zero-count N-grams.

N-gram language model with Kneser-Ney smoothing

from nltk.lm import MLE, Laplace, KneserNeyInterpolated
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import word_tokenize, sent_tokenize
import math

# Training corpus
corpus_text = """
The cat sat on the mat. The cat ate the rat.
The dog sat on the log. The dog ran in the fog.
A quick brown fox jumps over the lazy dog.
The quick fox ran fast. The brown dog barked loudly.
"""

# Tokenise into sentences and words
sentences = [word_tokenize(s.lower()) for s in sent_tokenize(corpus_text) if s.strip()]

# Prepare n-gram training data (2-gram bigram model)
n = 2
train_data, padded_vocab = padded_everygram_pipeline(n, sentences)

# MLE bigram model (maximum likelihood estimation)
lm_mle = MLE(n)
lm_mle.fit(train_data, padded_vocab)

# Laplace smoothed bigram model (handles unseen n-grams)
train_data2, padded_vocab2 = padded_everygram_pipeline(n, sentences)
lm_laplace = Laplace(n)
lm_laplace.fit(train_data2, padded_vocab2)

# Compute probabilities
print("MLE P('mat'|'the') =", lm_mle.score('mat', ['the']))
print("MLE P('elephant'|'the') =", lm_mle.score('elephant', ['the']))   # 0 — never seen
print("Laplace P('elephant'|'the') =", lm_laplace.score('elephant', ['the']))  # Small but non-zero

# Perplexity: how surprised is the model by the test text?
# Lower = better model
test_data, _ = padded_everygram_pipeline(n, [word_tokenize("the cat ran fast")])
test_list = list(test_data)
if test_list:
    ppl_laplace = lm_laplace.perplexity(test_list[0] if test_list else [])
    print(f"Laplace perplexity: {ppl_laplace:.2f}")

# Perplexity formula: PP(W) = P(w1...wN)^(-1/N) = exp(-1/N * sum(log P(wi|context)))
def compute_perplexity(model, test_sentence, n):
    tokens = word_tokenize(test_sentence.lower())
    log_prob = sum(math.log(max(model.score(tokens[i], tokens[max(0,i-n+1):i]), 1e-10))
                   for i in range(1, len(tokens)))
    return math.exp(-log_prob / (len(tokens) - 1)) if len(tokens) > 1 else float('inf')

ppl = compute_perplexity(lm_laplace, "the cat sat on the mat", n)
print(f"Manual perplexity (in-domain): {ppl:.2f}")      # Should be low
ppl2 = compute_perplexity(lm_laplace, "the elephant flew over", n)
print(f"Manual perplexity (out-domain): {ppl2:.2f}")    # Should be high

Causal vs Masked Language Modelling

Type	Training objective	Attention pattern	Models	Best for
Causal LM (CLM)	Predict NEXT token from left context only	Triangular mask (no future)	GPT-2/3/4, Claude, Llama, Mistral	Text generation, chatbots, code completion
Masked LM (MLM)	Predict MASKED tokens using full context	Full bidirectional attention	BERT, RoBERTa, ELECTRA	Text understanding, classification, NER, QA
Prefix LM	MLM on prefix, CLM on completion	Hybrid (full on prefix, causal on target)	T5, GLM, UnifiedLM	Seq2seq, translation, summarisation

Causal LM (GPT) vs Masked LM (BERT) in practice

from transformers import (pipeline, GPT2LMHeadModel, GPT2Tokenizer,
                            BertForMaskedLM, BertTokenizer)
import torch

# ── CAUSAL LM: GPT-2 (predict next token, auto-regressive) ──
generator = pipeline('text-generation', model='gpt2', max_new_tokens=30)
result = generator("The capital of France is", num_return_sequences=1)
print("GPT-2 completion:", result[0]['generated_text'])
# "The capital of France is Paris, and the city has a population of..."

# ── MASKED LM: BERT (fill in [MASK] using bidirectional context) ──
unmasker = pipeline('fill-mask', model='bert-base-uncased')
results  = unmasker("The capital of France is [MASK].")
for r in results[:3]:
    print(f"  {r['token_str']:15} (score: {r['score']:.3f})")
# paris          (score: 0.871)
# lyon           (score: 0.041)
# marseille      (score: 0.023)

# Computing perplexity with a neural LM
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt2     = GPT2LMHeadModel.from_pretrained('gpt2')
model_gpt2.eval()

def neural_lm_perplexity(text: str, model, tokenizer) -> float:
    """Compute perplexity of text under a causal language model."""
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
    return torch.exp(outputs.loss).item()  # e^(avg negative log-likelihood)

easy_text = "The dog barked at the cat"
hard_text = "The cat barked at the dog"  # Plausible but less common
random_text = "Colourless green ideas sleep furiously"  # Grammatical but nonsensical

print(f"
Perplexity comparison (GPT-2):")
for text in [easy_text, hard_text, random_text]:
    ppl = neural_lm_perplexity(text, model_gpt2, tokenizer_gpt2)
    print(f"  PPL={ppl:.1f}: {text}")

Why masked LM cannot be used for generation

BERT uses bidirectional attention — each token attends to ALL other tokens including those to its right (future). At generation time, future tokens do not exist yet. BERT cannot generate token-by-token because its predictions assume the full sequence is available. GPT uses causal (unidirectional) attention — token i only attends to tokens 0..i-1. This constraint is exactly what enables auto-regressive generation.

Practice questions

Perplexity of a language model on a test set decreased from 150 to 45. Is this better or worse? (Answer: Better — lower perplexity means the model is less surprised by the test text (assigns higher probability to it). Perplexity = 1 is perfect (knows exactly what comes next). Human-level perplexity on clean news text is ~50-100; GPT-4 achieves ~10-20 on standard benchmarks.)
Why is causal language modelling (CLM) the pre-training objective for GPT but not BERT? (Answer: GPT is auto-regressive — it generates text left-to-right, so it must be trained to predict the next token from only left context (causal). BERT is used for understanding tasks where the full sentence is available, so it uses masked LM with bidirectional attention for richer representations.)
N-gram LM: P("Paris" | "is") = 0 because "is Paris" never appeared in training. What is this problem and solution? (Answer: Zero probability / sparse data problem. Any N-gram not seen in training has P=0 — multiplying by zero zeros the entire sentence probability. Solution: smoothing — Laplace (add-1), Kneser-Ney, or Good-Turing redistribute probability mass to unseen N-grams.)
What does a perplexity of 1 mean for a language model? (Answer: The model has perfect knowledge — it assigns probability 1.0 to every correct next token, meaning it is never surprised. This is impossible in practice on real text; it would imply the model has memorised the test set.)
T5 uses a "prefix LM" style training. What does this enable over pure CLM or MLM? (Answer: Prefix LM: apply full bidirectional attention on the input prefix (encoder-style), causal attention on the output (decoder-style). This naturally handles seq2seq tasks (translation, summarisation) where the full input is known but the output is generated auto-regressively.)

On LumiChats

LumiChats uses a causal language model — each response token is generated by predicting the most likely next token given all previous tokens in the conversation. Perplexity on held-out human conversations is one of the key metrics used to evaluate and improve language model quality.

Try it free

Language Modelling — N-gram, Neural & Transformer LMs

Real-life analogy: The auto-complete engine

N-gram language models

Causal vs Masked Language Modelling

Practice questions

Try LumiChats for ₹69

Related Terms