Machine learning models cannot process raw text — they require numerical input. Text representation methods convert documents into fixed-length numerical vectors. Classical approaches include One-Hot Encoding, Bag-of-Words (BoW), TF-IDF, and N-gram models. These methods form the foundation of traditional NLP and are still widely used in production systems for search, spam detection, and text classification where interpretability and speed matter more than deep contextual understanding.
Real-life analogy: The word frequency ledger
Imagine a library cataloguing books by counting how many times each word appears. A book about cooking uses 'flour', 'bake', 'oven' many times. A book about machine learning uses 'gradient', 'loss', 'epoch'. Bag-of-Words is exactly this ledger — a document is just a column of word counts. TF-IDF refines it: common words like 'the' that appear in every book get their counts downweighted because they are not distinctive.
One-Hot Encoding and Bag-of-Words
One-Hot Encoding: Each word in the vocabulary V is represented as a vector of size |V| with a 1 at its index and 0 everywhere else. 'cat' in a vocab of 10,000 words → vector of 9,999 zeros and one 1. Problem: cannot capture similarity (cat and kitten are just as different as cat and aeroplane).
Bag-of-Words (BoW): A document is represented as the sum of its token one-hot vectors — equivalently, a vector of word counts. Word order is completely ignored ('dog bites man' == 'man bites dog' in BoW). Vocabulary size is typically pruned to the top 10,000–50,000 most frequent words.
Bag-of-Words with sklearn CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are friends",
]
# CountVectorizer builds vocab and transforms docs to BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
# ['and', 'are', 'cat', 'cats', 'dog', 'dogs', 'friends', 'log', 'mat', 'on', 'sat', 'the']
print("BoW matrix:
", X.toarray())
# Doc 1: [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 2]
# Doc 2: [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 2]
# Doc 3: [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0]TF-IDF — weighing importance
TF-IDF (Term Frequency–Inverse Document Frequency) downweights common words that appear in many documents and upweights words that are distinctive to specific documents.
TF = frequency of term t in document d, normalised by total terms. IDF = log of total documents N divided by number of documents containing t. Words appearing in every document get IDF ≈ 0.
TF-IDF with sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
"machine learning is great",
"deep learning is a subset of machine learning",
"natural language processing uses machine learning",
"deep learning powers natural language processing",
]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)
# Display as readable DataFrame
df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
print(df.round(3))
# 'is' and 'learning' appear in many docs → lower TF-IDF
# 'natural', 'language', 'processing' more distinctive → higher TF-IDFN-gram language models
An N-gram is a contiguous sequence of N tokens. Unigrams (N=1) = individual words. Bigrams (N=2) = pairs. Trigrams (N=3) = triples. N-gram language models estimate P(word | previous N-1 words) from corpus frequency counts — they capture local word order that BoW ignores.
N-gram Markov assumption: probability of next word depends only on the last N-1 words. C() = count in training corpus. Smoothing (Laplace, Kneser-Ney) handles zero-count N-grams.
| Method | Captures word order? | Handles synonyms? | Sparsity | Best for |
|---|---|---|---|---|
| One-Hot | No | No | Very high | Baseline experiments |
| BoW | No | No | High | Fast text classification |
| TF-IDF | No | No | High | Search, document similarity |
| N-grams | Local (N words) | No | Explodes with N | Spell check, autocomplete |
| Word2Vec/BERT | Full context | Yes | Dense (low) | Semantic tasks, modern NLP |
Data sparsity problem
N-gram models suffer from data sparsity: most N-gram sequences (especially trigrams and above) never appear in the training corpus, giving them zero probability. Smoothing techniques (Laplace add-1, Kneser-Ney, Good-Turing) redistribute probability mass to unseen N-grams. This is the primary motivation for neural language models.
Practice questions
- A corpus has 3 documents, vocabulary of 500 words. What is the shape of the BoW matrix? (Answer: 3 x 500 — one row per document, one column per vocabulary word.)
- Word "the" appears in every document. What is its IDF score? (Answer: log(N/N) = log(1) = 0. It contributes nothing to TF-IDF — this is the desired behaviour.)
- What is the bigram model approximation for P(cat | the quick brown)? (Answer: P(cat | brown) — bigrams use only the immediately preceding word, ignoring everything before it.)
- Why does BoW fail for sentiment analysis on "not good"? (Answer: BoW ignores word order — "not good" and "good not" produce identical vectors, losing the negation.)
- TF-IDF gives a high score to a word that: (Answer: Appears frequently in THIS document but rarely across all documents. High TF * high IDF = distinctive term.)
On LumiChats
When LumiChats uses semantic search to find relevant context in your documents, it uses dense vector embeddings (the modern successor to TF-IDF) to measure document relevance. Understanding TF-IDF helps you appreciate why keyword search fails for synonyms and why vector search is better.
Try it free