Frequentist vs. Bayesian Inference. Parameters, Priors

What Each Framework Does

Frequentist and Bayesian inference both aim to learn about unknown parameters $\theta$ from observed data $X$ . They differ in what they consider random, what constitutes a valid answer, and how prior information enters the analysis.

Frequentist: The parameter $\theta$ is a fixed (unknown) constant. Data $X$ is random. Inference is about the long-run behavior of estimators and test procedures across hypothetical repeated experiments.

Bayesian: The parameter $\theta$ is a random variable with a prior distribution $p(\theta)$ encoding beliefs before seeing data. After observing data, the prior is updated to a posterior $p(\theta|X)$ via Bayes' theorem. Inference is about the posterior distribution.

Side-by-Side Core Formulas

Definition

Maximum Likelihood Estimation (Frequentist)

The frequentist workhorse is maximum likelihood. Given data $X = (x_1, \ldots, x_n)$ and a model $p(X|\theta)$ , the MLE is:

$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^n p(x_i|\theta) = \arg\max_\theta \sum_{i=1}^n \log p(x_i|\theta)$

The MLE treats $\theta$ as a fixed parameter to be estimated. Its quality is measured by properties like consistency, efficiency, and bias, defined over hypothetical repeated sampling.

Definition

Bayesian Posterior (MAP and Full Posterior)

The Bayesian approach applies Bayes' theorem:

$p(\theta|X) = \frac{p(X|\theta)\,p(\theta)}{p(X)} \propto p(X|\theta)\,p(\theta)$

The MAP estimate (Maximum A Posteriori) is the mode of the posterior:

$\hat{\theta}_{\text{MAP}} = \arg\max_\theta\; p(X|\theta)\,p(\theta)$

But Bayesians typically report the full posterior $p(\theta|X)$ , not just a point estimate. The posterior encodes all information about $\theta$ given the data and the prior.

Where Each Is Stronger

Frequentist wins on objectivity and large-sample theory

Frequentist methods require no prior specification. The MLE depends only on the likelihood, making the analysis independent of subjective beliefs. This is valuable when:

You want results that are reproducible and not dependent on analyst choice
The sample size is large enough that the data dominate any prior
You need formal guarantees (e.g., confidence interval coverage) that hold regardless of the true parameter value

The asymptotic theory of MLEs is powerful: under regularity conditions, the MLE is consistent, asymptotically normal, and achieves the Cramér-Rao lower bound.

Bayesian wins on coherence and small-sample inference

Bayesian inference provides a complete probability distribution over parameters, enabling natural answers to questions like "what is the probability that $\theta > 0$ ?" Frequentist inference cannot answer such questions because $\theta$ is not a random variable in that framework.

Bayesian methods are particularly strong when:

The sample size is small and prior information is genuinely available
The model is complex (hierarchical models, latent variables)
You need to quantify uncertainty in a decision-theoretically coherent way
You want to combine information from multiple studies naturally

Key Concepts That Differ

	Frequentist	Bayesian
Parameter	Fixed unknown constant	Random variable with a prior
Data	Random (from repeated sampling)	Fixed (once observed)
Point estimate	MLE $\hat{\theta}_{\text{MLE}}$	MAP $\hat{\theta}_{\text{MAP}}$ or posterior mean
Interval estimate	Confidence interval	Credible interval
Interpretation	Long-run frequency properties	Degree of belief
Prior	Not used	Required (explicitly specified)
Nuisance parameters	Profiled or plugged in	Marginalized (integrated out)

MLE vs. MAP: The Connection

Proposition

MAP Reduces to Regularized MLE

Statement

If the prior is $p(\theta) \propto \exp(-\lambda\, g(\theta))$ for some penalty function $g$ , then:

$\hat{\theta}_{\text{MAP}} = \arg\max_\theta\; \left[\sum_{i=1}^n \log p(x_i|\theta) - \lambda\, g(\theta)\right]$

This is exactly the regularized MLE. Gaussian prior $\theta \sim \mathcal{N}(0, \sigma^2 I)$ gives $L_2$ regularization (ridge). Laplace prior gives $L_1$ regularization (lasso).

Intuition

The MAP estimate bridges the two frameworks. It adds a prior-derived penalty to the log-likelihood. As $n \to \infty$ , the likelihood dominates the prior, and MAP converges to MLE. With finite data, the prior acts as regularization, pulling the estimate toward regions of high prior density.

report a correction →

Confidence Intervals vs. Credible Intervals

The most commonly confused distinction:

A 95% confidence interval $[L(X), U(X)]$ is a random interval (because it depends on the data) such that across repeated experiments, $\Pr(\theta \in [L(X), U(X)]) = 0.95$ . It does not mean there is a 95% probability that $\theta$ is in the specific interval you computed.

A 95% credible interval $[a, b]$ satisfies $\int_a^b p(\theta|X)\,d\theta = 0.95$ . Given the prior and the data, there is a 95% posterior probability that $\theta \in [a, b]$ . This is the interpretation most people incorrectly assign to confidence intervals.

In many common settings (regular models, large samples, diffuse priors), the two intervals are numerically similar. They diverge in small samples, with strong priors, or in models with boundary effects.

When a Researcher Would Use Each

Example

Clinical trial with regulatory requirements

Use frequentist methods. Regulatory agencies (FDA, EMA) require pre-specified hypothesis tests with controlled Type I error rates. Confidence intervals with guaranteed coverage properties are the standard. The objectivity of frequentist methods is a feature in this context.

Example

Hierarchical model for small-area estimation

Use Bayesian methods. When you have data from many related groups (e.g., disease rates across counties), hierarchical Bayesian models naturally share information across groups through the prior. Partial pooling gives better estimates for small-sample groups than either complete pooling or no pooling.

Example

Large-scale neural network training

Use frequentist (MLE via gradient descent). With millions of parameters and large datasets, the data dominate any reasonable prior. Bayesian inference over all neural network weights is computationally intractable except via crude approximations (variational inference, MC dropout). Regularization provides the practical benefits of priors without the computational cost of full Bayesian inference.

Example

A/B testing with prior conversion rate data

Either works. Frequentist sequential testing (with corrections for multiple looks) is standard. But Bayesian A/B testing with an informative prior on the baseline conversion rate can reach conclusions faster and provides direct probability statements that stakeholders find more intuitive.

Common Confusions

Watch Out

Bayesian does not mean subjective

The prior can be chosen objectively (reference priors, Jeffreys priors, maximum entropy priors). Many Bayesian analyses use weakly informative priors that have minimal impact on the posterior. The prior is a modeling choice, not necessarily a personal belief.

Watch Out

Frequentist does not mean prior-free

Frequentist methods implicitly depend on modeling choices (the likelihood function, the hypothesis class, the test statistic) that play a role similar to the prior. The prior is just more explicit about where assumptions enter. James-Stein estimation, a frequentist procedure, has a Bayesian interpretation and dominates the MLE in high dimensions.

Watch Out

The posterior is not the sampling distribution

$p(\theta|X)$ (Bayesian posterior) and $p(X|\theta)$ viewed as a function of $X$ for fixed $\theta$ (sampling distribution) are different objects. Confusing them leads to misinterpreting confidence intervals as credible intervals. They are related by Bayes' theorem, but only after specifying a prior.