Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem with the 'naive' assumption that all features are conditionally independent given the class label. Despite this unrealistic assumption, it works remarkably well in practice — especially for text classification, spam detection, and sentiment analysis. It is extremely fast (O(n) training and inference), requires minimal data, and provides calibrated probability estimates. Three main variants: Gaussian NB (continuous features), Multinomial NB (count data — NLP), Bernoulli NB (binary features).
Real-life analogy: The spam filter
To classify 'Win a free iPhone NOW!!!' as spam: count how often 'Win', 'free', 'iPhone', 'NOW' appear in spam vs ham emails. If each word is 10× more common in spam, the combined probability of spam is enormous — even though you ignored correlations between the words (the naive assumption). The independence assumption is wrong (spam words correlate), but the ranking is still correct and classification works well.
Bayes theorem and the naive independence assumption
Naive Bayes classification rule. P(y) = class prior. P(xᵢ|y) = likelihood of feature i given class y. The naive assumption: features are conditionally independent given y. Predict: ŷ = argmax_y P(y) × Π P(xᵢ|y). No need to compute the normalisation constant P(x) since we compare across classes.
Gaussian, Multinomial, and Bernoulli Naive Bayes
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.datasets import make_classification, load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
# ── Gaussian NB: for continuous features (assumes Gaussian distribution) ──
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
gnb = GaussianNB(var_smoothing=1e-9) # Small smoothing for numerical stability
gnb.fit(X_train, y_train)
print(f"Gaussian NB accuracy: {gnb.score(X_test, y_test):.3f}")
# Gaussian NB learns: mean and variance of each feature per class
# ── Multinomial NB: for word count features (most common for NLP) ──
texts = ['buy cheap viagra online', 'win free money now',
'meeting at 3pm tomorrow', 'please review the report']
labels = ['spam', 'spam', 'ham', 'ham']
pipeline_mnb = Pipeline([
('vectorizer', CountVectorizer()), # Word count features
('classifier', MultinomialNB(alpha=1.0)) # Laplace smoothing
])
pipeline_mnb.fit(texts, labels)
print("Multinomial NB predictions:",
pipeline_mnb.predict(['free phone offer', 'team meeting agenda']))
# Alpha (Laplace smoothing): prevents P(word|class)=0 for unseen words
# With alpha=1: each word gets count+1 / (total_words + vocab_size)
# ── Bernoulli NB: for binary features (word present/absent) ──
pipeline_bnb = Pipeline([
('vectorizer', CountVectorizer(binary=True)), # Binary: 1 if word appears, else 0
('classifier', BernoulliNB(alpha=1.0))
])
pipeline_bnb.fit(texts, labels)
# ── Complement NB: better than Multinomial for imbalanced classes ──
pipeline_cnb = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', ComplementNB(alpha=0.1))
])
pipeline_cnb.fit(texts, labels)
# Real-world spam classification
spam_texts = ['congratulations you have been selected win 5000 dollars',
'the project deadline is next friday please review',
'claim your free gift card limited time offer click now']
for text in spam_texts:
pred = pipeline_mnb.predict([text])[0]
prob = pipeline_mnb.predict_proba([text]).max()
print(f"{pred:4s} ({prob:.2%}): {text[:50]}")Laplace smoothing and when Naive Bayes works best
The zero-frequency problem: If a word appears in spam but never in ham training data, P(word|ham)=0. The entire product becomes 0 — the model can never predict ham if that word is present! Laplace (add-1) smoothing: add 1 to every word count, so no word ever has zero probability: P(word|class) = (count(word,class) + α) / (count(class) + α × vocab_size).
| NB Variant | Feature type | Assumption | Best use case |
|---|---|---|---|
| Gaussian NB | Continuous | P(xᵢ|y) is Gaussian | Continuous sensor data, medical measurements |
| Multinomial NB | Count (integers) | Multinomial distribution | Text classification, document categorisation |
| Bernoulli NB | Binary (0/1) | Bernoulli distribution | Document presence/absence, boolean features |
| Complement NB | Count | Models complement class | Imbalanced text classification |
When Naive Bayes beats more complex models
NB wins when: (1) Training data is small — NB learns well even from 100 examples. (2) Features are genuinely independent (medical symptoms, word frequencies in many topics). (3) Real-time inference needed — NB is O(features) per prediction. (4) Probability calibration matters — NB outputs interpretable probabilities. NB loses when: features are strongly correlated (it double-counts evidence), or when decision boundaries are complex and non-linear.
Practice questions
- Why is Naive Bayes called "naive"? (Answer: It makes the naive assumption that all features are conditionally independent given the class label. In reality, features are almost always correlated (spam words "free" and "win" often appear together). Despite this wrong assumption, the classifier often works well in practice.)
- A new email contains the word "pharmaceutical" which never appeared in training spam emails. P("pharmaceutical"|spam)=0. What happens to the spam probability? (Answer: The entire product P(y=spam) × Π P(xᵢ|spam) becomes 0 because multiplying by 0 zeros everything. Laplace smoothing (alpha=1) prevents this by adding 1 to all counts, ensuring no zero probabilities.)
- Gaussian NB vs Multinomial NB — when would you use each? (Answer: Gaussian NB: features are real-valued and approximately normally distributed (height, temperature, sensor readings). Multinomial NB: features are word counts or frequencies (text classification, spam detection, topic modelling). Never mix — using Gaussian NB on word counts gives poor results.)
- Naive Bayes is said to be a "generative" model. What does this mean? (Answer: Generative models learn P(X|y) — the distribution of features given the class. They can generate new synthetic examples. Discriminative models (logistic regression, SVM) learn P(y|X) directly. Generative models work better with small data; discriminative models achieve higher accuracy with large data.)
- Why does Naive Bayes often outperform more complex models on small datasets? (Answer: Complex models (SVM, neural networks) have many parameters and need large datasets to estimate them reliably. NB only needs to estimate class priors and per-feature likelihoods — very few parameters. Less data needed for reliable estimates. Bias-variance tradeoff: NB has high bias but very low variance on small datasets.)
On LumiChats
Naive Bayes is used in LumiChats for fast intent classification and content filtering. Its probabilistic outputs directly give confidence scores — LumiChats uses these to decide when to ask for clarification vs proceed with a response.
Try it free