Tokenization is the process of converting raw text into a sequence of tokens — the fundamental units that language models process. Tokens are not words; they are subword units produced by algorithms like Byte Pair Encoding (BPE) or WordPiece that split text into frequent substrings. 'Hello world' might tokenize as ['Hello', ' world'] (2 tokens), while 'antidisestablishmentarianism' might tokenize as ['ant', 'idis', 'estab', 'lish', 'ment', 'arian', 'ism'] (7 tokens). Understanding tokenization is essential for predicting AI costs, understanding context window limits, and diagnosing unexpected model behaviours.
Byte Pair Encoding (BPE) — how GPT tokenizes text
BPE, used by GPT-2, GPT-3, GPT-4, and GPT-5.4, starts with a vocabulary of individual bytes (256 characters). It then iteratively merges the most frequent adjacent pair of existing vocabulary items, adding the merged pair as a new vocabulary entry. After ~50,000–100,000 merges, the result is a vocabulary of common subword units that balances vocabulary size with token count efficiency.
Tokenizing text with tiktoken (OpenAI) and transformers (HuggingFace) — see exactly how your text splits
# OpenAI's tiktoken — for GPT-3.5, GPT-4, GPT-5.4
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
text = "LumiChats is a pay-per-day AI platform for students."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}") # → 11
print(f"Token IDs: {tokens}")
# Decode individual tokens to see the splits
for token_id in tokens:
print(repr(enc.decode([token_id])))
# → 'L' 'umi' 'Chats' ' is' ' a' ' pay' '-per' '-day' ' AI' ' platform' ' for' ' students' '.'
# HuggingFace tokenizers — for Claude, Llama, Mistral etc.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
result = tokenizer("antidisestablishmentarianism", return_offsets_mapping=True)
print(result.tokens())
# → ['▁ant', 'idis', 'establish', 'ment', 'arian', 'ism'] (6 tokens)
# Cost estimation: most APIs charge ~$0.001 per 1000 tokens
text_length = len("This is a 100-word document " * 4) # ~100 words
estimated_tokens = text_length / 4 # rough rule: 1 token ≈ 4 characters in English
print(f"Estimated tokens: {estimated_tokens}")| Text | GPT-4o tokens | Notes |
|---|---|---|
| Hello, world! | 4 | Common words tokenize efficiently |
| antidisestablishmentarianism | 6 | Rare word = many subword tokens |
| 1+1=2 | 5 | Math expressions are tokenized character-by-character |
| राम (Hindi: Ram) | 4–8 | Non-Latin scripts are less efficient — more tokens per word |
| def fibonacci(n): | 6 | Code keywords are common — efficient tokenization |
| aaaaaaa | 3–7 | Repetition breaks into subword chunks, not single tokens |
Why tokenization matters for users
- API cost: You pay per token, not per word or character. Knowing that 1 token ≈ 4 characters in English (but 1–2 characters in Hindi, Tamil, or Arabic) helps estimate costs for non-English content.
- Context window: A model with a 200K token context window can process roughly 150,000 words of English — or significantly less in token-rich languages. Context limits are always stated in tokens.
- Arithmetic and spelling: LLMs sometimes struggle with character-level tasks because they see tokens, not characters. 'How many R's are in strawberry?' is hard because 'strawberry' may tokenize as ['str', 'awb', 'erry'] — the model must reason about token internals to count characters.
- Non-English efficiency: English text averages ~1.3 characters per token. Hindi, Arabic, and Chinese average 2–4 characters per token — making the same content 2–3× more expensive and context-window-hungry than English.
Checking token counts before sending
For any application where cost matters, count tokens before sending to the API. OpenAI's tiktoken library (pip install tiktoken) counts tokens for GPT models in milliseconds. For Anthropic's Claude, use the count_tokens API endpoint. For HuggingFace models, use the tokenizer's len() method. Building token counting into your prompt construction code prevents unexpected cost overruns in production.
Practice questions
- Why does the GPT-4 tokeniser encode 'Hello World' differently from 'hello world'? (Answer: BPE vocabularies are case-sensitive — uppercase and lowercase are different byte sequences. 'Hello' and 'hello' are different tokens (different IDs). This is why LLM costs and context windows are counted in tokens, not characters, and why changing capitalisation can affect token count.)
- What is the 'tokenizer tax' for non-English languages and why does it exist? (Answer: BPE vocabularies are built from training corpora that are predominantly English. Non-English text uses tokens less efficiently: 'Hello' = 1 token in GPT-4; the Japanese equivalent uses 2–3 tokens; Arabic characters may use 1–4 tokens each. This means the same information requires more tokens in non-English, increasing costs and reducing effective context length for multilingual users. Models like mT5 and multilingual embeddings use balanced vocabularies.)
- What is the 'lost in the middle' problem in tokenisation? (Answer: This refers to attention, not tokenisation directly. However, tokenisation affects it: important tokens in the middle of a very long context receive less attention weight. For a 128K context window, a crucial fact buried at position 64K receives less attention than facts at the start or end. This is a limitation of transformer attention patterns and affects retrieval from long contexts.)
- A user's prompt has 500 words. Why might it use 700 tokens instead of 500? (Answer: The ~1.3 tokens-per-word English average accounts for: compound words split into subword tokens (running → run + ning), punctuation as separate tokens, spaces encoded as part of the following token (the Ġ prefix in GPT tokenisers), numbers split digit-by-digit, and special characters as multi-token sequences. Code and technical text often tokenises worse than prose: variable names, brackets, and operators each become separate tokens.)
- Why would you want to control tokenisation when building an NLP system? (Answer: (1) Cost control: fewer tokens = lower API costs. Prompt compression techniques remove tokens redundant for the model. (2) Context fitting: staying within context window limits. (3) Special token handling: ensuring [CLS], [SEP], or chat template tokens are added correctly. (4) Consistency: same prompt with different whitespace/punctuation can tokenise differently, causing unexpected model behaviour. Production systems normalise input text before tokenisation for consistency.)