Definition

Machine learning models cannot process raw text — they require numerical input. Text representation methods convert documents into fixed-length numerical vectors. Classical approaches include One-Hot Encoding, Bag-of-Words (BoW), TF-IDF, and N-gram models. These methods form the foundation of traditional NLP and are still widely used in production systems for search, spam detection, and text classification where interpretability and speed matter more than deep contextual understanding.

Real-life analogy: The word frequency ledger

Imagine a library cataloguing books by counting how many times each word appears. A book about cooking uses 'flour', 'bake', 'oven' many times. A book about machine learning uses 'gradient', 'loss', 'epoch'. Bag-of-Words is exactly this ledger — a document is just a column of word counts. TF-IDF refines it: common words like 'the' that appear in every book get their counts downweighted because they are not distinctive.

One-Hot Encoding and Bag-of-Words

One-Hot Encoding: Each word in the vocabulary V is represented as a vector of size |V| with a 1 at its index and 0 everywhere else. 'cat' in a vocab of 10,000 words → vector of 9,999 zeros and one 1. Problem: cannot capture similarity (cat and kitten are just as different as cat and aeroplane).

Bag-of-Words (BoW): A document is represented as the sum of its token one-hot vectors — equivalently, a vector of word counts. Word order is completely ignored ('dog bites man' == 'man bites dog' in BoW). Vocabulary size is typically pruned to the top 10,000–50,000 most frequent words.

Bag-of-Words with sklearn CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are friends",
]

# CountVectorizer builds vocab and transforms docs to BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
# ['and', 'are', 'cat', 'cats', 'dog', 'dogs', 'friends', 'log', 'mat', 'on', 'sat', 'the']

print("BoW matrix:
", X.toarray())
# Doc 1: [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 2]
# Doc 2: [0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 2]
# Doc 3: [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0]

TF-IDF — weighing importance

TF-IDF (Term Frequency–Inverse Document Frequency) downweights common words that appear in many documents and upweights words that are distinctive to specific documents.

TF = frequency of term t in document d, normalised by total terms. IDF = log of total documents N divided by number of documents containing t. Words appearing in every document get IDF ≈ 0.

TF-IDF with sklearn

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    "machine learning is great",
    "deep learning is a subset of machine learning",
    "natural language processing uses machine learning",
    "deep learning powers natural language processing",
]

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)

# Display as readable DataFrame
df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
print(df.round(3))
# 'is' and 'learning' appear in many docs → lower TF-IDF
# 'natural', 'language', 'processing' more distinctive → higher TF-IDF

N-gram language models

An N-gram is a contiguous sequence of N tokens. Unigrams (N=1) = individual words. Bigrams (N=2) = pairs. Trigrams (N=3) = triples. N-gram language models estimate P(word | previous N-1 words) from corpus frequency counts — they capture local word order that BoW ignores.

N-gram Markov assumption: probability of next word depends only on the last N-1 words. C() = count in training corpus. Smoothing (Laplace, Kneser-Ney) handles zero-count N-grams.

Method	Captures word order?	Handles synonyms?	Sparsity	Best for
One-Hot	No	No	Very high	Baseline experiments
BoW	No	No	High	Fast text classification
TF-IDF	No	No	High	Search, document similarity
N-grams	Local (N words)	No	Explodes with N	Spell check, autocomplete
Word2Vec/BERT	Full context	Yes	Dense (low)	Semantic tasks, modern NLP

Data sparsity problem

N-gram models suffer from data sparsity: most N-gram sequences (especially trigrams and above) never appear in the training corpus, giving them zero probability. Smoothing techniques (Laplace add-1, Kneser-Ney, Good-Turing) redistribute probability mass to unseen N-grams. This is the primary motivation for neural language models.

Practice questions

A corpus has 3 documents, vocabulary of 500 words. What is the shape of the BoW matrix? (Answer: 3 x 500 — one row per document, one column per vocabulary word.)
Word "the" appears in every document. What is its IDF score? (Answer: log(N/N) = log(1) = 0. It contributes nothing to TF-IDF — this is the desired behaviour.)
What is the bigram model approximation for P(cat | the quick brown)? (Answer: P(cat | brown) — bigrams use only the immediately preceding word, ignoring everything before it.)
Why does BoW fail for sentiment analysis on "not good"? (Answer: BoW ignores word order — "not good" and "good not" produce identical vectors, losing the negation.)
TF-IDF gives a high score to a word that: (Answer: Appears frequently in THIS document but rarely across all documents. High TF * high IDF = distinctive term.)

On LumiChats

When LumiChats uses semantic search to find relevant context in your documents, it uses dense vector embeddings (the modern successor to TF-IDF) to measure document relevance. Understanding TF-IDF helps you appreciate why keyword search fails for synonyms and why vector search is better.

Try it free

Text Representation — BoW, TF-IDF & N-grams

Real-life analogy: The word frequency ledger

One-Hot Encoding and Bag-of-Words

TF-IDF — weighing importance

N-gram language models

Practice questions

Try LumiChats for ₹69

Related Terms