Glossary/Tokenisation — Word, Subword, BPE & WordPiece
Natural Language Processing

Tokenisation — Word, Subword, BPE & WordPiece

Step zero of every NLP pipeline — splitting text into units a model can process.


Definition

Tokenisation splits raw text into discrete units called tokens. Word tokenisation splits on whitespace and punctuation. Subword tokenisation (BPE, WordPiece, SentencePiece) splits words into smaller fragments — enabling models to handle rare words, morphology, and multilingual text with a fixed vocabulary. Every modern LLM (BERT, GPT-4, Claude, Llama) uses subword tokenisation. Understanding tokenisation explains why LLMs count tokens not words, why 'tokenisation' and 'tokenization' may differ, and how the vocabulary size affects model capacity.

Real-life analogy: The alphabet vs syllables vs words

Word tokenisation: each word is a unit. Problem: 'running', 'runner', 'runs' are treated as completely different tokens — the model must learn each separately. Subword tokenisation: 'running' → ['run', '##ning']. The model learns 'run' once and reuses it for running, runner, runs. Unknown words? 'transformative' → ['transform', '##ative'] — both known subwords. Subword is to words what syllables are to reading — smaller units that compose into anything.

Word tokenisation and its limitations

Word, sentence, and character tokenisation

import re
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer

text = "Dr. Smith won't stop! He's running 5km/day. #fitness @coach"

# Word tokenisation (NLTK)
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)
# ['Dr.', 'Smith', 'wo', "n't", 'stop', '!', 'He', "'s", 'running', ...]

# Sentence tokenisation
sent_tokens = sent_tokenize("Hello World. How are you? I am fine.")
print("Sentences:", sent_tokens)

# Tweet tokeniser (preserves hashtags, mentions)
tweet_tok = TweetTokenizer()
tweet_tokens = tweet_tok.tokenize(text)
print("Tweet tokens:", tweet_tokens)
# ['Dr.', 'Smith', "won't", 'stop', '!', "He's", 'running', '5km/day', '#fitness', '@coach']

# Character tokenisation (rare — used for Chinese, code generation)
char_tokens = list("hello")   # ['h', 'e', 'l', 'l', 'o']

# Problems with word tokenisation:
print("
Vocabulary problems:")
words = ["run", "running", "runner", "runs", "ran",      # Same root, 5 tokens
         "colour", "color",                               # British/American spelling
         "ChatGPT", "transformers", "BERT",              # New words → unknown
         "don't", "won't", "can't"]                      # Contractions split inconsistently

BPE — Byte Pair Encoding

BPE (Sennrich et al., 2016) starts with a character vocabulary and iteratively merges the most frequent adjacent pairs. The merge table is learned from the corpus and applied to new text. Used by GPT-2, GPT-3, GPT-4, Llama, Mistral, and most modern LLMs.

BPE algorithm from scratch and tiktoken

from collections import Counter, defaultdict

def train_bpe(corpus: list, num_merges: int):
    """
    Train BPE tokeniser on a corpus.
    Returns list of merge rules.
    """
    # Initialise: represent each word as characters + end-of-word marker
    vocab = {}
    for word in corpus:
        chars = list(word) + ['</w>']
        vocab[' '.join(chars)] = corpus.count(word)

    merges = []
    for _ in range(num_merges):
        # Count all adjacent pairs in the current vocabulary
        pairs = defaultdict(int)
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i+1])] += freq

        if not pairs: break

        # Find most frequent pair
        best = max(pairs, key=pairs.get)
        merges.append(best)

        # Merge all occurrences of best pair in vocabulary
        new_vocab = {}
        bigram = re.escape(' '.join(best))
        for word in vocab:
            new_word = re.sub(bigram, ''.join(best), word)
            new_vocab[new_word] = vocab[word]
        vocab = new_vocab

    return vocab, merges

corpus = ['low', 'lower', 'newest', 'widest', 'wider', 'new', 'newer']
vocab, merges = train_bpe(corpus, num_merges=10)
print("Learned merges:", merges[:5])
# [('e', 's'), ('es', 't'), ('lo', 'w'), ('n', 'ew'), ...]
print("Final vocabulary:", list(vocab.keys())[:5])

# ── Production: tiktoken (OpenAI tokeniser) ──
import tiktoken

# GPT-4 tokeniser
enc = tiktoken.get_encoding("cl100k_base")   # GPT-4 encoding

texts = ["tokenization", "tokenisation", "I'm running fast",
         "Hello, World!", "ChatGPT is amazing!!!"]

for text in texts:
    tokens = enc.encode(text)
    decoded = [enc.decode([t]) for t in tokens]
    print(f"{text!r:40} → {len(tokens)} tokens: {decoded}")

WordPiece and SentencePiece

MethodUsed byMerge criterionOOV handlingVocabulary mark
BPEGPT-2/3/4, Llama, Mistral, RoBERTaMost frequent adjacent pairFalls back to characterNo special mark
WordPieceBERT, DistilBERT, ELECTRAMaximises likelihood of training dataUses [UNK] token## prefix on continuation
SentencePieceT5, ALBERT, LLaMA-2, mT5BPE or unigram LM on raw bytesHandles any Unicode▁ prefix on word starts
Unigram LMALBERT, T5 (via SentencePiece)Probabilistic — prune low-likelihood tokensRobustVaries

WordPiece tokenisation with BERT tokeniser

from transformers import BertTokenizer, GPT2Tokenizer, T5Tokenizer

bert_tok = BertTokenizer.from_pretrained('bert-base-uncased')
gpt2_tok = GPT2Tokenizer.from_pretrained('gpt2')

text = "The tokenization of 'unhappiness' demonstrates subword power."

bert_tokens = bert_tok.tokenize(text)
gpt2_tokens = gpt2_tok.tokenize(text)

print("BERT WordPiece:", bert_tokens)
# ['the', 'token', '##ization', 'of', "'", 'un', '##happiness', ...']
# Note: ## marks continuation sub-tokens

print("GPT-2 BPE:     ", gpt2_tokens)
# ['The', 'Ġtoken', 'ization', 'Ġof', "Ġ'", 'unh', 'appiness', ...]
# Note: Ġ marks space before word

# Vocabulary sizes
print(f"BERT vocab size: {bert_tok.vocab_size}")    # 30,522
print(f"GPT-2 vocab size: {gpt2_tok.vocab_size}")   # 50,257

# Full encoding with special tokens
encoded = bert_tok("Hello world!", return_tensors="pt")
print(f"
BERT input_ids: {encoded['input_ids']}")
# [101 (CLS), 7592 (Hello), 2088 (world), 999 (!), 102 (SEP)]

# Token count vs word count
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
sentence = "The quick brown fox jumps over the lazy dog"
print(f"Words: {len(sentence.split())}, Tokens: {len(enc.encode(sentence))}")
# Words: 9, Tokens: 10  (ratio ~1.3 for English)

Why token count matters for LLMs

LLMs have a context window measured in TOKENS, not words. GPT-4 has a 128K token context. English text averages ~1.3 tokens per word. 1000 words ≈ 1,300 tokens. Code and non-English text use more tokens per character. This is why LLM APIs charge per token, and why prompts must be concise. A single emoji can be 1-3 tokens; Chinese characters are often 1-2 tokens each.

Practice questions

  1. Why does BERT use WordPiece with 30,000 tokens instead of full word vocabulary? (Answer: A full word vocabulary for English needs 500k+ tokens — most appear rarely, requiring huge embedding matrices and poor OOV handling. 30k WordPiece tokens cover ~95% of real text while keeping the model trainable. Rare words are split into known subwords rather than marked as [UNK].)
  2. BERT tokenises "unhappiness" as ["un", "##happiness"]. What does ## mean? (Answer: ## marks a continuation subword — it does NOT start a new word. "un" starts the word; "##happiness" continues it. This allows the model to recognise "un-" as a prefix across many words (unkind, unfair, unfit) and "happiness" as a standalone word.)
  3. A text has 100 words. Approximately how many GPT-4 tokens is it? (Answer: English averages ~1.3 tokens/word. 100 words ≈ 130 tokens. But it depends on vocabulary richness — common words like "the", "a", "is" are 1 token each; technical terms may be split into 2-4 tokens.)
  4. What is the main advantage of SentencePiece over word-based tokenisers for multilingual models? (Answer: SentencePiece treats text as a raw byte stream with no language-specific preprocessing — no need for language-specific rules about word boundaries, spaces, or scripts. Works identically for English, Chinese, Arabic, Japanese. Used in mT5, XLM-R, and multilingual Llama models.)
  5. BPE merges "l o w " into "low ". What happens if "low" appears in test data but was never in training corpus? (Answer: BPE applies the learned merge rules sequentially to the character sequence l-o-w-. Even if "low" was never seen as a complete word, the individual characters and pairs are in the vocabulary. The model can still tokenise it, just possibly with more splits.)

On LumiChats

Every message you send to LumiChats is first tokenised using a BPE-like subword tokeniser before being processed. The context window limit, pricing, and generation speed all depend on token count, not word count. LumiChats can show you the token breakdown of any text you provide.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms