Naive Bayes

Sneiderman, Robby

ML Methods

Naive Bayes

The simplest generative classifier: assume conditional independence of features given the class, estimate class-conditional densities, and classify via MAP. Why it works despite the wrong independence assumption.

CoreTier 2StableSupporting~35 min

Prerequisites

Common Probability Distributions

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 1 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Logistic Regression

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Naive Bayes is the simplest generative classifier. It teaches two fundamental lessons: (1) Bayes rule turns a generative model $p(x|y)$ into a classifier, and (2) strong (even wrong) assumptions can yield good predictions when data is scarce. Naive Bayes remains competitive for text classification and high-dimensional sparse data, and understanding why it works despite its wrong independence assumption is a deep insight about the difference between estimation and prediction.

Mental Model

Instead of directly learning a decision boundary $p(y|x)$ (discriminative), naive Bayes models how data is generated: first nature picks a class $y$ , then features $x_1, \ldots, x_D$ are generated independently given $y$ . The independence assumption is almost always wrong. But it makes estimation trivial (each feature is estimated separately), and the resulting classifier is often competitive with more complex classifiers despite the independence assumption.

Formal Setup

Definition

Naive Bayes Model

The naive Bayes generative model assumes:

$p(x, y) = p(y) \prod_{j=1}^{D} p(x_j \mid y)$

This factors the joint distribution using the conditional independence assumption: features $x_1, \ldots, x_D$ are independent given the class label $y$ .

Classification uses Bayes rule and the MAP decision rule:

$\hat{y} = \arg\max_y \, p(y|x) = \arg\max_y \, p(y) \prod_{j=1}^{D} p(x_j \mid y)$

The normalizing constant $p(x)$ cancels in the argmax and need not be computed.

Taking logarithms, the decision rule becomes additive:

$\hat{y} = \arg\max_y \left[\log p(y) + \sum_{j=1}^{D} \log p(x_j \mid y)\right]$

This is a linear classifier in the log-probability features. The decision boundary between any two classes is a hyperplane in log-probability space.

Variants

Definition

Multinomial Naive Bayes

For text classification, represent a document as word counts $x = (x_1, \ldots, x_V)$ where $V$ is the vocabulary size. The class-conditional is multinomial:

$p(x \mid y) \propto \prod_{j=1}^{V} \theta_{jy}^{x_j}$

where $\theta_{jy} = p(\text{word } j \mid \text{class } y)$ . The MLE is the relative frequency: $\hat{\theta}_{jy} = \frac{n_{jy}}{\sum_k n_{ky}}$ where $n_{jy}$ is the count of word $j$ in documents of class $y$ .

Laplace smoothing adds a pseudocount $\alpha$ (typically 1) to prevent zero probabilities:

$\hat{\theta}_{jy} = \frac{n_{jy} + \alpha}{\sum_k n_{ky} + V\alpha}$

Definition

Gaussian Naive Bayes

For continuous features, assume each feature is Gaussian given the class:

$p(x_j \mid y = k) = \mathcal{N}(x_j; \mu_{jk}, \sigma_{jk}^2)$

Parameters $\mu_{jk}$ and $\sigma_{jk}^2$ are estimated from the training data in class $k$ . The log-posterior becomes a quadratic function of $x$ , making the decision boundary quadratic (or linear if all classes share the same variance).

Why Naive Bayes Works Despite Wrong Assumptions

Proposition

Naive Bayes Classification Optimality

Statement

The naive Bayes classifier can achieve the Bayes-optimal decision boundary even when the conditional independence assumption is violated. What matters is not whether $p(y|x)$ is estimated correctly, but whether the ranking $p(y=1|x) \gtrless p(y=0|x)$ is correct. The independence assumption can distort the estimated probabilities while preserving the correct ordering.

Intuition

Classification only requires getting the sign of $\log p(y=1|x) - \log p(y=0|x)$ right, not its magnitude. Naive Bayes may estimate $p(y|x) = 0.99$ when the true value is $p(y|x) = 0.7$ . the probability is miscalibrated, but the classification is correct. The independence assumption acts like a strong regularizer: it reduces the number of parameters from exponential (full joint) to linear (separate marginals), preventing overfitting even with limited data.

Proof Sketch

Consider binary classification with the log-odds: $\log\frac{p(y=1|x)}{p(y=0|x)} = \log\frac{p(y=1)}{p(y=0)} + \sum_j \log\frac{p(x_j|y=1)}{p(x_j|y=0)}$ . Even though the individual $p(x_j|y)$ terms ignore dependencies, the sum can still have the correct sign for most $x$ in the data distribution. The decision boundary (where the sum equals zero) can match the Bayes-optimal boundary even when the individual terms are wrong, because errors in different terms can cancel.

Why It Matters

This result illustrates a general principle in ML: a model can be wrong about the data-generating process but still make good predictions. The bias-variance tradeoff explains why: the independence assumption introduces bias (wrong model) but drastically reduces variance (fewer parameters to estimate). For small or high-dimensional datasets, this tradeoff favors naive Bayes.

Failure Mode

When features are strongly correlated within a class and the correlation structure differs between classes, naive Bayes can get the ranking wrong. For example, if two features are perfectly correlated in class 1 but independent in class 0, the independence assumption misestimates the density ratio and can misclassify. Naive Bayes also fails when probability calibration (not just ranking) matters, such as in decision-making under uncertainty.

report a correction →

Generative vs. Discriminative

Naive Bayes is a generative classifier: it models the full joint $p(x, y)$ and derives $p(y|x)$ via Bayes rule. Logistic regression is its discriminative counterpart: it directly models $p(y|x)$ without modeling $p(x)$ .

Key tradeoffs:

Naive Bayes reaches its asymptotic error rate faster (fewer samples) because the independence assumption reduces the number of parameters
Logistic regression has lower asymptotic error because it does not suffer from model misspecification of $p(x|y)$
The crossover point is typically at moderate sample sizes

Canonical Examples

Example

Spam classification with multinomial NB

Given emails labeled spam/ham, represent each as a bag of word counts. For class $y \in \{\text{spam}, \text{ham}\}$ , estimate $p(y)$ from class frequencies and $p(\text{word}_j \mid y)$ from word counts with Laplace smoothing. A new email is classified as $\hat{y} = \arg\max_y [\log p(y) + \sum_j x_j \log \theta_{jy}]$ . Despite ignoring word order and word interactions, this achieves greater than 95% accuracy on standard spam benchmarks.

Common Confusions

Watch Out

Naive Bayes probabilities are not calibrated

The posterior probabilities $p(y|x)$ from naive Bayes are typically overconfident. They cluster near 0 and 1 even when the true probabilities are moderate. This happens because the independence assumption treats redundant features as independent evidence, compounding the confidence. Use Platt scaling or isotonic regression to calibrate the probabilities if you need them for downstream decisions.

Watch Out

The naive assumption is about features given the class, not features in general

Naive Bayes assumes $x_j \perp x_k \mid y$ (conditional independence given the class), not $x_j \perp x_k$ (marginal independence). Features can be highly correlated marginally but independent within each class. The conditioning on $y$ is critical. It is what makes the assumption less unreasonable than it first appears.

Summary

Naive Bayes factors $p(x, y) = p(y) \prod_j p(x_j | y)$ . independence given the class
MAP classification: $\hat{y} = \arg\max_y [p(y) \prod_j p(x_j|y)]$
The log-posterior is linear in the log-probability features. naive Bayes is a linear classifier
Multinomial NB for text (word counts); Gaussian NB for continuous features
Works despite wrong independence assumption because classification only needs the ranking of $p(y|x)$ to be correct
Probabilities are miscalibrated (overconfident) due to double-counting correlated evidence

Exercises

ExerciseCore

Problem

You have two classes with equal prior $p(y=0) = p(y=1) = 0.5$ and two binary features. The class-conditional probabilities are $p(x_1=1|y=1) = 0.8$ , $p(x_2=1|y=1) = 0.7$ , $p(x_1=1|y=0) = 0.3$ , $p(x_2=1|y=0) = 0.4$ . Classify the point $x = (1, 1)$ .

ExerciseAdvanced

Problem

Show that for binary classification with binary features, the naive Bayes log-odds $\log\frac{p(y=1|x)}{p(y=0|x)}$ is a linear function of $x$ . What does this say about the decision boundary?

References

Canonical:

Mitchell, Machine Learning (1997), Chapter 6
Bishop, Pattern Recognition and Machine Learning (2006), Section 4.2

Current:

Ng & Jordan, "On Discriminative vs. Generative Classifiers" (2002). The definitive comparison
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 6.6.3

Next Topics

The natural next steps from naive Bayes:

Logistic regression: the discriminative counterpart to naive Bayes
Linear regression: the regression analog of linear generative models

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Common Probability Distributionslayer 0A · tier 1

Derived topics

2

Linear Regressionlayer 1 · tier 1
Logistic Regressionlayer 1 · tier 1

Graph-backed continuations

Logistic Regression Linear Regression