Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Naive Bayes

The simplest generative classifier: assume conditional independence of features given the class, estimate class-conditional densities, and classify via MAP. Why it works despite the wrong independence assumption.

CoreTier 2Stable~35 min

Why This Matters

Naive Bayes is the simplest generative classifier. It teaches two fundamental lessons: (1) Bayes rule turns a generative model p(xy)p(x|y) into a classifier, and (2) strong (even wrong) assumptions can yield good predictions when data is scarce. Naive Bayes remains competitive for text classification and high-dimensional sparse data, and understanding why it works despite its wrong independence assumption is a deep insight about the difference between estimation and prediction.

Mental Model

Instead of directly learning a decision boundary p(yx)p(y|x) (discriminative), naive Bayes models how data is generated: first nature picks a class yy, then features x1,,xDx_1, \ldots, x_D are generated independently given yy. The independence assumption is almost always wrong. But it makes estimation trivial (each feature is estimated separately), and the resulting classifier is often competitive with more complex classifiers despite the independence assumption.

Formal Setup

Definition

Naive Bayes Model

The naive Bayes generative model assumes:

p(x,y)=p(y)j=1Dp(xjy)p(x, y) = p(y) \prod_{j=1}^{D} p(x_j \mid y)

This factors the joint distribution using the conditional independence assumption: features x1,,xDx_1, \ldots, x_D are independent given the class label yy.

Classification uses Bayes rule and the MAP decision rule:

y^=argmaxyp(yx)=argmaxyp(y)j=1Dp(xjy)\hat{y} = \arg\max_y \, p(y|x) = \arg\max_y \, p(y) \prod_{j=1}^{D} p(x_j \mid y)

The normalizing constant p(x)p(x) cancels in the argmax and need not be computed.

Taking logarithms, the decision rule becomes additive:

y^=argmaxy[logp(y)+j=1Dlogp(xjy)]\hat{y} = \arg\max_y \left[\log p(y) + \sum_{j=1}^{D} \log p(x_j \mid y)\right]

This is a linear classifier in the log-probability features. The decision boundary between any two classes is a hyperplane in log-probability space.

Variants

Definition

Multinomial Naive Bayes

For text classification, represent a document as word counts x=(x1,,xV)x = (x_1, \ldots, x_V) where VV is the vocabulary size. The class-conditional is multinomial:

p(xy)j=1Vθjyxjp(x \mid y) \propto \prod_{j=1}^{V} \theta_{jy}^{x_j}

where θjy=p(word jclass y)\theta_{jy} = p(\text{word } j \mid \text{class } y). The MLE is the relative frequency: θ^jy=njyknky\hat{\theta}_{jy} = \frac{n_{jy}}{\sum_k n_{ky}} where njyn_{jy} is the count of word jj in documents of class yy.

Laplace smoothing adds a pseudocount α\alpha (typically 1) to prevent zero probabilities:

θ^jy=njy+αknky+Vα\hat{\theta}_{jy} = \frac{n_{jy} + \alpha}{\sum_k n_{ky} + V\alpha}

Definition

Gaussian Naive Bayes

For continuous features, assume each feature is Gaussian given the class:

p(xjy=k)=N(xj;μjk,σjk2)p(x_j \mid y = k) = \mathcal{N}(x_j; \mu_{jk}, \sigma_{jk}^2)

Parameters μjk\mu_{jk} and σjk2\sigma_{jk}^2 are estimated from the training data in class kk. The log-posterior becomes a quadratic function of xx, making the decision boundary quadratic (or linear if all classes share the same variance).

Why Naive Bayes Works Despite Wrong Assumptions

Proposition

Naive Bayes Classification Optimality

Statement

The naive Bayes classifier can achieve the Bayes-optimal decision boundary even when the conditional independence assumption is violated. What matters is not whether p(yx)p(y|x) is estimated correctly, but whether the ranking p(y=1x)p(y=0x)p(y=1|x) \gtrless p(y=0|x) is correct. The independence assumption can distort the estimated probabilities while preserving the correct ordering.

Intuition

Classification only requires getting the sign of logp(y=1x)logp(y=0x)\log p(y=1|x) - \log p(y=0|x) right, not its magnitude. Naive Bayes may estimate p(yx)=0.99p(y|x) = 0.99 when the true value is p(yx)=0.7p(y|x) = 0.7. the probability is miscalibrated, but the classification is correct. The independence assumption acts like a strong regularizer: it reduces the number of parameters from exponential (full joint) to linear (separate marginals), preventing overfitting even with limited data.

Proof Sketch

Consider binary classification with the log-odds: logp(y=1x)p(y=0x)=logp(y=1)p(y=0)+jlogp(xjy=1)p(xjy=0)\log\frac{p(y=1|x)}{p(y=0|x)} = \log\frac{p(y=1)}{p(y=0)} + \sum_j \log\frac{p(x_j|y=1)}{p(x_j|y=0)}. Even though the individual p(xjy)p(x_j|y) terms ignore dependencies, the sum can still have the correct sign for most xx in the data distribution. The decision boundary (where the sum equals zero) can match the Bayes-optimal boundary even when the individual terms are wrong, because errors in different terms can cancel.

Why It Matters

This result illustrates a general principle in ML: a model can be wrong about the data-generating process but still make good predictions. The bias-variance tradeoff explains why: the independence assumption introduces bias (wrong model) but drastically reduces variance (fewer parameters to estimate). For small or high-dimensional datasets, this tradeoff favors naive Bayes.

Failure Mode

When features are strongly correlated within a class and the correlation structure differs between classes, naive Bayes can get the ranking wrong. For example, if two features are perfectly correlated in class 1 but independent in class 0, the independence assumption misestimates the density ratio and can misclassify. Naive Bayes also fails when probability calibration (not just ranking) matters, such as in decision-making under uncertainty.

Generative vs. Discriminative

Naive Bayes is a generative classifier: it models the full joint p(x,y)p(x, y) and derives p(yx)p(y|x) via Bayes rule. Logistic regression is its discriminative counterpart: it directly models p(yx)p(y|x) without modeling p(x)p(x).

Key tradeoffs:

  • Naive Bayes reaches its asymptotic error rate faster (fewer samples) because the independence assumption reduces the number of parameters
  • Logistic regression has lower asymptotic error because it does not suffer from model misspecification of p(xy)p(x|y)
  • The crossover point is typically at moderate sample sizes

Canonical Examples

Example

Spam classification with multinomial NB

Given emails labeled spam/ham, represent each as a bag of word counts. For class y{spam,ham}y \in \{\text{spam}, \text{ham}\}, estimate p(y)p(y) from class frequencies and p(wordjy)p(\text{word}_j \mid y) from word counts with Laplace smoothing. A new email is classified as y^=argmaxy[logp(y)+jxjlogθjy]\hat{y} = \arg\max_y [\log p(y) + \sum_j x_j \log \theta_{jy}]. Despite ignoring word order and word interactions, this achieves greater than 95% accuracy on standard spam benchmarks.

Common Confusions

Watch Out

Naive Bayes probabilities are not calibrated

The posterior probabilities p(yx)p(y|x) from naive Bayes are typically overconfident. They cluster near 0 and 1 even when the true probabilities are moderate. This happens because the independence assumption treats redundant features as independent evidence, compounding the confidence. Use Platt scaling or isotonic regression to calibrate the probabilities if you need them for downstream decisions.

Watch Out

The naive assumption is about features given the class, not features in general

Naive Bayes assumes xjxkyx_j \perp x_k \mid y (conditional independence given the class), not xjxkx_j \perp x_k (marginal independence). Features can be highly correlated marginally but independent within each class. The conditioning on yy is critical. It is what makes the assumption less unreasonable than it first appears.

Summary

  • Naive Bayes factors p(x,y)=p(y)jp(xjy)p(x, y) = p(y) \prod_j p(x_j | y). independence given the class
  • MAP classification: y^=argmaxy[p(y)jp(xjy)]\hat{y} = \arg\max_y [p(y) \prod_j p(x_j|y)]
  • The log-posterior is linear in the log-probability features. naive Bayes is a linear classifier
  • Multinomial NB for text (word counts); Gaussian NB for continuous features
  • Works despite wrong independence assumption because classification only needs the ranking of p(yx)p(y|x) to be correct
  • Probabilities are miscalibrated (overconfident) due to double-counting correlated evidence

Exercises

ExerciseCore

Problem

You have two classes with equal prior p(y=0)=p(y=1)=0.5p(y=0) = p(y=1) = 0.5 and two binary features. The class-conditional probabilities are p(x1=1y=1)=0.8p(x_1=1|y=1) = 0.8, p(x2=1y=1)=0.7p(x_2=1|y=1) = 0.7, p(x1=1y=0)=0.3p(x_1=1|y=0) = 0.3, p(x2=1y=0)=0.4p(x_2=1|y=0) = 0.4. Classify the point x=(1,1)x = (1, 1).

ExerciseAdvanced

Problem

Show that for binary classification with binary features, the naive Bayes log-odds logp(y=1x)p(y=0x)\log\frac{p(y=1|x)}{p(y=0|x)} is a linear function of xx. What does this say about the decision boundary?

References

Canonical:

  • Mitchell, Machine Learning (1997), Chapter 6
  • Bishop, Pattern Recognition and Machine Learning (2006), Section 4.2

Current:

  • Ng & Jordan, "On Discriminative vs. Generative Classifiers" (2002). The definitive comparison

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from naive Bayes:

  • Logistic regression: the discriminative counterpart to naive Bayes
  • Linear regression: the regression analog of linear generative models

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics