Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Frequentist vs. Bayesian Inference

Two foundational philosophies of statistical inference: frequentists treat parameters as fixed unknowns and data as random, Bayesians treat parameters as random variables with prior distributions and compute posteriors.

What Each Framework Does

Frequentist and Bayesian inference both aim to learn about unknown parameters θ\theta from observed data XX. They differ in what they consider random, what constitutes a valid answer, and how prior information enters the analysis.

Frequentist: The parameter θ\theta is a fixed (unknown) constant. Data XX is random. Inference is about the long-run behavior of estimators and test procedures across hypothetical repeated experiments.

Bayesian: The parameter θ\theta is a random variable with a prior distribution p(θ)p(\theta) encoding beliefs before seeing data. After observing data, the prior is updated to a posterior p(θX)p(\theta|X) via Bayes' theorem. Inference is about the posterior distribution.

Side-by-Side Core Formulas

Definition

Maximum Likelihood Estimation (Frequentist)

The frequentist workhorse is maximum likelihood. Given data X=(x1,,xn)X = (x_1, \ldots, x_n) and a model p(Xθ)p(X|\theta), the MLE is:

θ^MLE=argmaxθi=1np(xiθ)=argmaxθi=1nlogp(xiθ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^n p(x_i|\theta) = \arg\max_\theta \sum_{i=1}^n \log p(x_i|\theta)

The MLE treats θ\theta as a fixed parameter to be estimated. Its quality is measured by properties like consistency, efficiency, and bias, defined over hypothetical repeated sampling.

Definition

Bayesian Posterior (MAP and Full Posterior)

The Bayesian approach applies Bayes' theorem:

p(θX)=p(Xθ)p(θ)p(X)p(Xθ)p(θ)p(\theta|X) = \frac{p(X|\theta)\,p(\theta)}{p(X)} \propto p(X|\theta)\,p(\theta)

The MAP estimate (Maximum A Posteriori) is the mode of the posterior:

θ^MAP=argmaxθ  p(Xθ)p(θ)\hat{\theta}_{\text{MAP}} = \arg\max_\theta\; p(X|\theta)\,p(\theta)

But Bayesians typically report the full posterior p(θX)p(\theta|X), not just a point estimate. The posterior encodes all information about θ\theta given the data and the prior.

Where Each Is Stronger

Frequentist wins on objectivity and large-sample theory

Frequentist methods require no prior specification. The MLE depends only on the likelihood, making the analysis independent of subjective beliefs. This is valuable when:

The asymptotic theory of MLEs is powerful: under regularity conditions, the MLE is consistent, asymptotically normal, and achieves the Cramér-Rao lower bound.

Bayesian wins on coherence and small-sample inference

Bayesian inference provides a complete probability distribution over parameters, enabling natural answers to questions like "what is the probability that θ>0\theta > 0?" Frequentist inference cannot answer such questions because θ\theta is not a random variable in that framework.

Bayesian methods are particularly strong when:

Key Concepts That Differ

FrequentistBayesian
ParameterFixed unknown constantRandom variable with a prior
DataRandom (from repeated sampling)Fixed (once observed)
Point estimateMLE θ^MLE\hat{\theta}_{\text{MLE}}MAP θ^MAP\hat{\theta}_{\text{MAP}} or posterior mean
Interval estimateConfidence intervalCredible interval
InterpretationLong-run frequency propertiesDegree of belief
PriorNot usedRequired (explicitly specified)
Nuisance parametersProfiled or plugged inMarginalized (integrated out)

MLE vs. MAP: The Connection

Proposition

MAP Reduces to Regularized MLE

Statement

If the prior is p(θ)exp(λg(θ))p(\theta) \propto \exp(-\lambda\, g(\theta)) for some penalty function gg, then:

θ^MAP=argmaxθ  [i=1nlogp(xiθ)λg(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta\; \left[\sum_{i=1}^n \log p(x_i|\theta) - \lambda\, g(\theta)\right]

This is exactly the regularized MLE. Gaussian prior θN(0,σ2I)\theta \sim \mathcal{N}(0, \sigma^2 I) gives L2L_2 regularization (ridge). Laplace prior gives L1L_1 regularization (lasso).

Intuition

The MAP estimate bridges the two frameworks. It adds a prior-derived penalty to the log-likelihood. As nn \to \infty, the likelihood dominates the prior, and MAP converges to MLE. With finite data, the prior acts as regularization, pulling the estimate toward regions of high prior density.

Confidence Intervals vs. Credible Intervals

The most commonly confused distinction:

A 95% confidence interval [L(X),U(X)][L(X), U(X)] is a random interval (because it depends on the data) such that across repeated experiments, Pr(θ[L(X),U(X)])=0.95\Pr(\theta \in [L(X), U(X)]) = 0.95. It does not mean there is a 95% probability that θ\theta is in the specific interval you computed.

A 95% credible interval [a,b][a, b] satisfies abp(θX)dθ=0.95\int_a^b p(\theta|X)\,d\theta = 0.95. Given the prior and the data, there is a 95% posterior probability that θ[a,b]\theta \in [a, b]. This is the interpretation most people incorrectly assign to confidence intervals.

In many common settings (regular models, large samples, diffuse priors), the two intervals are numerically similar. They diverge in small samples, with strong priors, or in models with boundary effects.

When a Researcher Would Use Each

Example

Clinical trial with regulatory requirements

Use frequentist methods. Regulatory agencies (FDA, EMA) require pre-specified hypothesis tests with controlled Type I error rates. Confidence intervals with guaranteed coverage properties are the standard. The objectivity of frequentist methods is a feature in this context.

Example

Hierarchical model for small-area estimation

Use Bayesian methods. When you have data from many related groups (e.g., disease rates across counties), hierarchical Bayesian models naturally share information across groups through the prior. Partial pooling gives better estimates for small-sample groups than either complete pooling or no pooling.

Example

Large-scale neural network training

Use frequentist (MLE via gradient descent). With millions of parameters and large datasets, the data dominate any reasonable prior. Bayesian inference over all neural network weights is computationally intractable except via crude approximations (variational inference, MC dropout). Regularization provides the practical benefits of priors without the computational cost of full Bayesian inference.

Example

A/B testing with prior conversion rate data

Either works. Frequentist sequential testing (with corrections for multiple looks) is standard. But Bayesian A/B testing with an informative prior on the baseline conversion rate can reach conclusions faster and provides direct probability statements that stakeholders find more intuitive.

Common Confusions

Watch Out

Bayesian does not mean subjective

The prior can be chosen objectively (reference priors, Jeffreys priors, maximum entropy priors). Many Bayesian analyses use weakly informative priors that have minimal impact on the posterior. The prior is a modeling choice, not necessarily a personal belief.

Watch Out

Frequentist does not mean prior-free

Frequentist methods implicitly depend on modeling choices (the likelihood function, the hypothesis class, the test statistic) that play a role similar to the prior. The prior is just more explicit about where assumptions enter. James-Stein estimation, a frequentist procedure, has a Bayesian interpretation and dominates the MLE in high dimensions.

Watch Out

The posterior is not the sampling distribution

p(θX)p(\theta|X) (Bayesian posterior) and p(Xθ)p(X|\theta) viewed as a function of XX for fixed θ\theta (sampling distribution) are different objects. Confusing them leads to misinterpreting confidence intervals as credible intervals. They are related by Bayes' theorem, but only after specifying a prior.