Tokenisation splits raw text into discrete units called tokens. Word tokenisation splits on whitespace and punctuation. Subword tokenisation (BPE, WordPiece, SentencePiece) splits words into smaller fragments — enabling models to handle rare words, morphology, and multilingual text with a fixed vocabulary. Every modern LLM (BERT, GPT-4, Claude, Llama) uses subword tokenisation. Understanding tokenisation explains why LLMs count tokens not words, why 'tokenisation' and 'tokenization' may differ, and how the vocabulary size affects model capacity.
Real-life analogy: The alphabet vs syllables vs words
Word tokenisation: each word is a unit. Problem: 'running', 'runner', 'runs' are treated as completely different tokens — the model must learn each separately. Subword tokenisation: 'running' → ['run', '##ning']. The model learns 'run' once and reuses it for running, runner, runs. Unknown words? 'transformative' → ['transform', '##ative'] — both known subwords. Subword is to words what syllables are to reading — smaller units that compose into anything.
Word tokenisation and its limitations
Word, sentence, and character tokenisation
import re
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer
text = "Dr. Smith won't stop! He's running 5km/day. #fitness @coach"
# Word tokenisation (NLTK)
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)
# ['Dr.', 'Smith', 'wo', "n't", 'stop', '!', 'He', "'s", 'running', ...]
# Sentence tokenisation
sent_tokens = sent_tokenize("Hello World. How are you? I am fine.")
print("Sentences:", sent_tokens)
# Tweet tokeniser (preserves hashtags, mentions)
tweet_tok = TweetTokenizer()
tweet_tokens = tweet_tok.tokenize(text)
print("Tweet tokens:", tweet_tokens)
# ['Dr.', 'Smith', "won't", 'stop', '!', "He's", 'running', '5km/day', '#fitness', '@coach']
# Character tokenisation (rare — used for Chinese, code generation)
char_tokens = list("hello") # ['h', 'e', 'l', 'l', 'o']
# Problems with word tokenisation:
print("
Vocabulary problems:")
words = ["run", "running", "runner", "runs", "ran", # Same root, 5 tokens
"colour", "color", # British/American spelling
"ChatGPT", "transformers", "BERT", # New words → unknown
"don't", "won't", "can't"] # Contractions split inconsistentlyBPE — Byte Pair Encoding
BPE (Sennrich et al., 2016) starts with a character vocabulary and iteratively merges the most frequent adjacent pairs. The merge table is learned from the corpus and applied to new text. Used by GPT-2, GPT-3, GPT-4, Llama, Mistral, and most modern LLMs.
BPE algorithm from scratch and tiktoken
from collections import Counter, defaultdict
def train_bpe(corpus: list, num_merges: int):
"""
Train BPE tokeniser on a corpus.
Returns list of merge rules.
"""
# Initialise: represent each word as characters + end-of-word marker
vocab = {}
for word in corpus:
chars = list(word) + ['</w>']
vocab[' '.join(chars)] = corpus.count(word)
merges = []
for _ in range(num_merges):
# Count all adjacent pairs in the current vocabulary
pairs = defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i+1])] += freq
if not pairs: break
# Find most frequent pair
best = max(pairs, key=pairs.get)
merges.append(best)
# Merge all occurrences of best pair in vocabulary
new_vocab = {}
bigram = re.escape(' '.join(best))
for word in vocab:
new_word = re.sub(bigram, ''.join(best), word)
new_vocab[new_word] = vocab[word]
vocab = new_vocab
return vocab, merges
corpus = ['low', 'lower', 'newest', 'widest', 'wider', 'new', 'newer']
vocab, merges = train_bpe(corpus, num_merges=10)
print("Learned merges:", merges[:5])
# [('e', 's'), ('es', 't'), ('lo', 'w'), ('n', 'ew'), ...]
print("Final vocabulary:", list(vocab.keys())[:5])
# ── Production: tiktoken (OpenAI tokeniser) ──
import tiktoken
# GPT-4 tokeniser
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
texts = ["tokenization", "tokenisation", "I'm running fast",
"Hello, World!", "ChatGPT is amazing!!!"]
for text in texts:
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]
print(f"{text!r:40} → {len(tokens)} tokens: {decoded}")WordPiece and SentencePiece
| Method | Used by | Merge criterion | OOV handling | Vocabulary mark |
|---|---|---|---|---|
| BPE | GPT-2/3/4, Llama, Mistral, RoBERTa | Most frequent adjacent pair | Falls back to character | No special mark |
| WordPiece | BERT, DistilBERT, ELECTRA | Maximises likelihood of training data | Uses [UNK] token | ## prefix on continuation |
| SentencePiece | T5, ALBERT, LLaMA-2, mT5 | BPE or unigram LM on raw bytes | Handles any Unicode | ▁ prefix on word starts |
| Unigram LM | ALBERT, T5 (via SentencePiece) | Probabilistic — prune low-likelihood tokens | Robust | Varies |
WordPiece tokenisation with BERT tokeniser
from transformers import BertTokenizer, GPT2Tokenizer, T5Tokenizer
bert_tok = BertTokenizer.from_pretrained('bert-base-uncased')
gpt2_tok = GPT2Tokenizer.from_pretrained('gpt2')
text = "The tokenization of 'unhappiness' demonstrates subword power."
bert_tokens = bert_tok.tokenize(text)
gpt2_tokens = gpt2_tok.tokenize(text)
print("BERT WordPiece:", bert_tokens)
# ['the', 'token', '##ization', 'of', "'", 'un', '##happiness', ...']
# Note: ## marks continuation sub-tokens
print("GPT-2 BPE: ", gpt2_tokens)
# ['The', 'Ġtoken', 'ization', 'Ġof', "Ġ'", 'unh', 'appiness', ...]
# Note: Ġ marks space before word
# Vocabulary sizes
print(f"BERT vocab size: {bert_tok.vocab_size}") # 30,522
print(f"GPT-2 vocab size: {gpt2_tok.vocab_size}") # 50,257
# Full encoding with special tokens
encoded = bert_tok("Hello world!", return_tensors="pt")
print(f"
BERT input_ids: {encoded['input_ids']}")
# [101 (CLS), 7592 (Hello), 2088 (world), 999 (!), 102 (SEP)]
# Token count vs word count
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
sentence = "The quick brown fox jumps over the lazy dog"
print(f"Words: {len(sentence.split())}, Tokens: {len(enc.encode(sentence))}")
# Words: 9, Tokens: 10 (ratio ~1.3 for English)Why token count matters for LLMs
LLMs have a context window measured in TOKENS, not words. GPT-4 has a 128K token context. English text averages ~1.3 tokens per word. 1000 words ≈ 1,300 tokens. Code and non-English text use more tokens per character. This is why LLM APIs charge per token, and why prompts must be concise. A single emoji can be 1-3 tokens; Chinese characters are often 1-2 tokens each.
Practice questions
- Why does BERT use WordPiece with 30,000 tokens instead of full word vocabulary? (Answer: A full word vocabulary for English needs 500k+ tokens — most appear rarely, requiring huge embedding matrices and poor OOV handling. 30k WordPiece tokens cover ~95% of real text while keeping the model trainable. Rare words are split into known subwords rather than marked as [UNK].)
- BERT tokenises "unhappiness" as ["un", "##happiness"]. What does ## mean? (Answer: ## marks a continuation subword — it does NOT start a new word. "un" starts the word; "##happiness" continues it. This allows the model to recognise "un-" as a prefix across many words (unkind, unfair, unfit) and "happiness" as a standalone word.)
- A text has 100 words. Approximately how many GPT-4 tokens is it? (Answer: English averages ~1.3 tokens/word. 100 words ≈ 130 tokens. But it depends on vocabulary richness — common words like "the", "a", "is" are 1 token each; technical terms may be split into 2-4 tokens.)
- What is the main advantage of SentencePiece over word-based tokenisers for multilingual models? (Answer: SentencePiece treats text as a raw byte stream with no language-specific preprocessing — no need for language-specific rules about word boundaries, spaces, or scripts. Works identically for English, Chinese, Arabic, Japanese. Used in mT5, XLM-R, and multilingual Llama models.)
- BPE merges "l o w " into "low ". What happens if "low" appears in test data but was never in training corpus? (Answer: BPE applies the learned merge rules sequentially to the character sequence l-o-w-. Even if "low" was never seen as a complete word, the individual characters and pairs are in the vocabulary. The model can still tokenise it, just possibly with more splits.)
On LumiChats
Every message you send to LumiChats is first tokenised using a BPE-like subword tokeniser before being processed. The context window limit, pricing, and generation speed all depend on token count, not word count. LumiChats can show you the token breakdown of any text you provide.
Try it free