Maximum A Posteriori (MAP) Estimation

Sneiderman, Robby

Statistical Estimation

Maximum A Posteriori (MAP) Estimation

Maximum a posteriori estimation as the posterior mode of a Bayesian model: derivation, the flat-prior recovery of MLE, the worked L2-norm-equals-Gaussian-prior and L1-norm-equals-Laplace-prior equivalences that make ridge and lasso Bayesian, and the invariance failure under reparameterization that distinguishes MAP from MLE.

CoreTier 1StableCore spine~70 min

Prerequisites

Maximum Likelihood Estimation Bayesian Estimation Common Probability Distributions Convex Optimization Basics

Prereq Map

Why This Matters

MAP estimation is the bridge most ML learners cross when they realize that "regularization" is not an engineering trick but a Bayesian model with a non-flat prior. The L2 penalty in ridge regression is exactly a Gaussian prior on the weights, and the L1 penalty in lasso is exactly a Laplace prior. Recognizing this collapses two separate-looking pages (frequentist regularization vs. Bayesian inference) into one Bayesian framework with different prior choices.

Three things this page does that the MLE page and the Bayesian-estimation page on their own do not:

Derives the L2-equals-Gaussian-prior equivalence in full.
Derives the L1-equals-Laplace-prior equivalence in full (this derivation does not appear elsewhere on the site).
Shows where MAP and MLE diverge: invariance under reparameterization, behavior in skewed posteriors, and asymptotic equivalence.

Mental Model

The frequentist MLE is the parameter that makes the observed data look least surprising:

$\hat\theta_{\mathrm{MLE}} = \arg\max_\theta \, \log p(D \mid \theta).$

The Bayesian posterior $\pi(\theta \mid D) \propto p(D \mid \theta) \pi(\theta)$ combines a prior $\pi(\theta)$ with the likelihood. The MAP estimator is the posterior mode:

$\hat\theta_{\mathrm{MAP}} = \arg\max_\theta \, \log \pi(\theta \mid D) = \arg\max_\theta \, \bigl[\log p(D \mid \theta) + \log \pi(\theta)\bigr].$

The only difference from MLE is the additive $\log \pi(\theta)$ term. When the prior is flat ( $\log \pi(\theta) = \text{const}$ ), MAP and MLE coincide. When the prior is informative, the prior pulls the estimate toward where it puts mass.

Reframing MAP as penalized MLE is the move that exposes ridge and lasso as Bayesian:

$\hat\theta_{\mathrm{MAP}} = \arg\min_\theta \, \bigl[ -\log p(D \mid \theta) - \log \pi(\theta) \bigr].$

Formal Setup

Definition

Maximum a Posteriori Estimator $\hat{θ}_{MAP}$

Given a likelihood $p(D \mid \theta)$ , a prior $\pi(\theta)$ , and observed data $D$ , the MAP estimator is

$\hat\theta_{\mathrm{MAP}} = \arg\max_{\theta \in \Theta} \pi(\theta \mid D) = \arg\max_\theta \, \bigl[\log p(D \mid \theta) + \log \pi(\theta)\bigr].$

Equivalently, $\hat\theta_{\mathrm{MAP}}$ is the mode of the posterior distribution. The mode may not exist (improper posteriors, unbounded log-densities), may not be unique (multimodal posteriors), and may sit on the boundary of $\Theta$ .

The MAP and the MLE differ by exactly one term: the log-prior. This is the only difference, and every distinguishing property of MAP traces back to it.

Proposition

MAP as Penalized MLE

Statement

With $\ell_n(\theta) = \log p(D \mid \theta)$ the log-likelihood and $r(\theta) = -\log \pi(\theta)$ the negative log-prior (the "regularizer"),

$\hat\theta_{\mathrm{MAP}} = \arg\min_\theta \, \bigl[ -\ell_n(\theta) + r(\theta) \bigr].$

In particular:

A flat prior ( $r$ constant) recovers MLE.
A Gaussian prior $\theta \sim \mathcal N(0, \tau^2 I)$ gives $r(\theta) = \frac{1}{2\tau^2} \|\theta\|_2^2$ plus a constant: an L2 penalty.
A Laplace prior $\theta_j \stackrel{\mathrm{iid}}{\sim} \mathrm{Lap}(0, b)$ gives $r(\theta) = \frac{1}{b} \|\theta\|_1$ plus a constant: an L1 penalty.

Intuition

A prior is a soft constraint on the parameter space, expressed as a penalty in log-space. Maximizing the posterior is the same as minimizing "negative log-likelihood plus penalty." The choice of prior is the choice of penalty.

Proof Sketch

Take logs of the posterior: $\log \pi(\theta \mid D) = \log p(D \mid \theta) + \log \pi(\theta) - \log p(D)$ . The last term does not depend on $\theta$ , so $\arg\max_\theta \log \pi(\theta \mid D) = \arg\max_\theta [\log p(D \mid \theta) + \log \pi(\theta)] = \arg\min_\theta[-\ell_n(\theta) - \log \pi(\theta)]$ . Substituting $r(\theta) = -\log \pi(\theta)$ gives the form. For the three special cases: a flat prior gives $r =$ const; a Gaussian prior gives $r(\theta) = \frac{1}{2\tau^2}\|\theta\|_2^2 +$ const; a Laplace prior gives $r(\theta) = \frac{1}{b}\|\theta\|_1 +$ const.

Why It Matters

This single identity unifies regularized estimation across the entire ML toolkit. Ridge, lasso, elastic net, weight decay, dropout (under specific Bayesian interpretations), label smoothing, and entropy regularization are all MAP estimates under various priors. Treating regularization as Bayesian gives one way to choose the regularization strength (via empirical Bayes on the prior variance), to derive shrinkage formulas, and to propagate parameter uncertainty into predictions (which MLE-plus-regularization does not do).

Failure Mode

The MAP identity is purely a re-expression of the posterior mode. It says nothing about whether MAP is a good estimator. In high dimensions or with multimodal posteriors, the mode can be unrepresentative of the posterior; the posterior mean is usually a better summary. MAP also fails to propagate uncertainty: a point estimate is a point, and the posterior covariance information is discarded. Full Bayesian inference (sampling, variational, or conjugate closed-form posteriors) is the answer when uncertainty matters.

report a correction →

Ridge as MAP under a Gaussian Prior

The cleanest demonstration that "L2 penalty = Gaussian prior" is the linear regression case, but the algebra is the same in any model.

Theorem

Ridge Regression as MAP under a Gaussian Prior

Statement

With Gaussian likelihood $y \mid X, w \sim \mathcal N(Xw, \sigma^2 I)$ and Gaussian prior $w \sim \mathcal N(0, \tau^2 I)$ , the MAP estimator equals the ridge estimator with regularization strength $\lambda = \sigma^2 / \tau^2$ :

$\hat w_{\mathrm{MAP}} = \arg\min_w \tfrac1{2\sigma^2} \|y - Xw\|_2^2 + \tfrac1{2\tau^2}\|w\|_2^2 = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y.$

Intuition

A tight Gaussian prior on $w$ (small $\tau^2$ ) is a strong belief that $w$ is near zero, which maps to a heavy ridge penalty (large $\lambda$ ). A weak prior (large $\tau^2$ ) maps to a light penalty, approaching OLS as $\tau^2 \to \infty$ .

Proof Sketch

The negative log-posterior in $w$ up to a $w$ -independent constant is

$-\log p(y \mid X, w) - \log \pi(w) = \tfrac1{2\sigma^2}\|y - Xw\|^2 + \tfrac1{2\tau^2}\|w\|^2 + \text{const}.$

Multiply by $2\sigma^2$ (positive, does not change the arg min):

$\|y - Xw\|^2 + \tfrac{\sigma^2}{\tau^2}\|w\|^2.$

This is the ridge objective with $\lambda = \sigma^2/\tau^2$ . Differentiate with respect to $w$ and set to zero: $-2 X^\top(y - Xw) + 2 \tfrac{\sigma^2}{\tau^2} w = 0$ , giving $\hat w = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y$ .

Why It Matters

This identity removes the conceptual gap between "frequentist regularized regression" and "Bayesian Gaussian prior". The ridge solution is the posterior mode under a Gaussian prior. The Bayesian linear regression page goes further and derives the full posterior (mean, covariance, predictive distribution); ridge as MAP is just the mean of that posterior with a specific prior choice.

Failure Mode

The equivalence is structural. It does not mean that $\lambda = \sigma^2/\tau^2$ chosen this way produces good predictions, because the prior $\tau^2$ may be misspecified. Empirical Bayes (estimating $\tau^2$ from marginal likelihood) or cross-validation (estimating $\lambda$ directly) are the two standard ways to pick the regularization strength. Treating $\lambda$ as a free hyperparameter is fine; pretending it has a single "right" value derived from $\sigma$ and a fixed $\tau$ is not.

report a correction →

Lasso as MAP under a Laplace Prior

The L1↔Laplace equivalence is structurally identical but uses a different prior density.

Theorem

Lasso Regression as MAP under a Laplace Prior

Statement

With Gaussian likelihood $y \mid X, w \sim \mathcal N(Xw, \sigma^2 I)$ and Laplace prior on each coefficient $w_j \stackrel{\mathrm{iid}}{\sim} \mathrm{Lap}(0, b)$ (density $\tfrac1{2b} \exp(-|w_j|/b)$ ), the MAP estimator equals the lasso estimator with regularization strength $\lambda = \sigma^2/b$ :

$\hat w_{\mathrm{MAP}} = \arg\min_w \tfrac1{2\sigma^2}\|y - Xw\|_2^2 + \tfrac1{b}\|w\|_1.$

Equivalently, the lasso objective with penalty $\lambda \|w\|_1$ is the MAP under a Laplace prior with scale $b = \sigma^2/\lambda$ .

Intuition

A Laplace density $\tfrac1{2b}\exp(-|w|/b)$ is sharply peaked at zero; much more peaked than a Gaussian. Its log is $-\frac{|w|}{b} - \log(2b)$ , which is the L1 penalty (up to constants). A small scale $b$ means the prior strongly favors $w = 0$ ; a large $b$ means the prior is diffuse. The Laplace's sharp peak at zero is what makes the resulting MAP estimate exactly sparse: the mode of the posterior literally lies at zero in the coordinates the data does not strongly pull away from zero. A Gaussian prior, by contrast, has a smooth peak at zero, so the MAP shrinks coefficients toward zero but never sets them exactly to zero. The sparsity of lasso is the sparsity of the posterior mode under a peaked prior.

Proof Sketch

The negative log-posterior in $w$ up to a $w$ -independent constant is

$-\log p(y \mid X, w) - \log \pi(w) = \tfrac1{2\sigma^2}\|y - Xw\|^2 + \tfrac1{b}\|w\|_1 + \text{const},$

using $\log \pi(w) = -\frac1b \|w\|_1 + \text{const}$ for the product of Laplace densities. Multiply through by $2\sigma^2$ to get $\|y - Xw\|^2 + \frac{2\sigma^2}{b}\|w\|_1$ ; or factor differently to express as $\tfrac1{2n}\|y - Xw\|^2 + \lambda \|w\|_1$ with $\lambda$ rescaled appropriately. The arg min is the lasso solution.

Why It Matters

This identity closes the ridge/lasso asymmetry. Both are MAP estimators; they only differ in prior choice. The sparsity property of lasso is the sparsity property of the Laplace prior's posterior mode. The fact that lasso produces exact zeros (and ridge does not) traces directly to the Laplace prior's sharp kink at zero (a non-differentiable peak) versus the Gaussian prior's smooth peak.

Failure Mode

The MAP is the mode of the lasso posterior, not its mean. The Laplace prior gives a posterior whose mode is sparse but whose mean is not (the posterior mean for any coordinate that has nonzero posterior mass on negative and positive values cannot equal zero exactly). Full Bayesian lasso inference, like the horseshoe or spike-and-slab, gives credible intervals around each coefficient instead of a single point; but loses the literal-zero point estimate. If you want sparsity, use MAP; if you want uncertainty, use full Bayes. The two answers do not coincide.

report a correction →

Worked Bernoulli/Beta MAP

To see MAP without linear algebra, take a single-parameter coin example. Let $x_1, \dots, x_n \stackrel{\mathrm{iid}}{\sim} \mathrm{Bernoulli}(\theta)$ with $k = \sum x_i$ successes. Place the conjugate prior $\theta \sim \mathrm{Beta}(\alpha, \beta)$ . The posterior is $\mathrm{Beta}(\alpha + k, \beta + n - k)$ (see conjugate priors).

The posterior log-density up to a constant is

$\log \pi(\theta \mid x) = (\alpha + k - 1)\log \theta + (\beta + n - k - 1)\log(1-\theta) + \text{const}.$

Differentiate and set to zero: $\frac{\alpha + k - 1}{\theta} - \frac{\beta + n - k - 1}{1-\theta} = 0$ , giving

$\hat\theta_{\mathrm{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}.$

Compare:

MLE ( $\mathrm{Beta}(1, 1)$ prior, which is flat): $\hat\theta_{\mathrm{MLE}} = k/n$ .
Posterior mean: $\mathbb E[\theta \mid x] = (\alpha + k)/(\alpha + \beta + n)$ ; this is the right summary if you care about squared-error loss.
MAP: $(\alpha + k - 1)/(\alpha + \beta + n - 2)$ ; this is the mode.

When $\alpha = \beta = 1$ (flat prior), the MAP simplifies to $k/n$ , recovering MLE exactly. When the prior is informative, the prior pseudo-counts $\alpha - 1$ and $\beta - 1$ shift the estimate toward $\alpha / (\alpha + \beta)$ .

Numerical case. Coin: prior $\mathrm{Beta}(5, 5)$ (mode at $1/2$ , ten pseudo-flips, five heads pseudo-five tails). Observed: 7 heads in 10 flips. MAP: $(5 + 7 - 1)/(5 + 5 + 10 - 2) = 11/18 \approx 0.611$ . MLE: $0.700$ . Posterior mean: $(5 + 7)/(5 + 5 + 10) = 12/20 = 0.600$ . The prior pulls the estimate from MLE 0.7 toward the prior mode 0.5; the posterior mean and the MAP differ slightly because the posterior $\mathrm{Beta}(12, 8)$ is mildly skewed.

When MAP and MLE Diverge: Three Failure Modes

The MAP-vs-MLE difference is the log-prior term. Three places where this term changes the answer qualitatively, not just numerically.

1. Invariance under reparameterization

The MLE is invariant under one-to-one reparameterization: if $\eta = g(\theta)$ for smooth bijection $g$ , then $\hat\eta_{\mathrm{MLE}} = g(\hat\theta_{\mathrm{MLE}})$ . The MAP is not: a flat prior in $\theta$ is not flat in $\eta$ (it picks up a Jacobian factor), and so the mode of the posterior in $\eta$ does not equal $g(\hat\theta_{\mathrm{MAP}})$ .

Concrete example. Bernoulli $\theta$ on $(0, 1)$ . Reparameterize by the log-odds $\eta = \log\frac{\theta}{1-\theta}$ . The flat prior $\pi(\theta) = 1$ on $(0, 1)$ transforms to $\pi(\eta) = e^\eta/(1 + e^\eta)^2$ on $\mathbb R$ ; not flat. So MAP under a flat-in- $\theta$ prior gives one answer, while MAP under a flat-in- $\eta$ prior gives a different answer when transformed back. This is the standard objection that "MAP depends on parameterization" and is the reason MLE remains the default frequentist estimator and why Jeffreys-prior MAP exists (Jeffreys's prior $\pi(\theta) \propto \sqrt{I(\theta)}$ is the unique prior whose induced MAP is invariant to reparameterization).

2. Skewed and multimodal posteriors

When the posterior is asymmetric or multimodal, the mode is a poor summary of the distribution's center. For a $\mathrm{Beta}(2, 5)$ posterior, the mode is $1/5 = 0.2$ but the mean is $2/7 \approx 0.286$ . For a bimodal posterior, the mode picks one peak arbitrarily based on which has slightly higher density, hiding the other entirely.

The decision-theoretic version: MAP minimizes the Bayes risk under 0-1 loss (the loss is one if your estimate misses the truth at all, zero if it hits exactly). For practical loss functions, the posterior mean (squared-error loss) or median (absolute-error loss) is usually a better summary.

3. Asymptotic agreement, finite-sample disagreement

As $n \to \infty$ , the posterior concentrates around the truth and the prior becomes negligible. By Bernstein-von Mises, the posterior is asymptotically Gaussian centered at the MLE, so MAP $\to$ MLE in probability. In finite samples, the disagreement is controlled by the prior strength relative to the data: with $n = 10$ and a prior worth 10 pseudo-samples, the prior is 50% of the information; with $n = 10\,000$ , the prior is 0.1% of the information and MAP $\approx$ MLE.

Common Confusions

Watch Out

The MAP is not the Bayesian posterior; it is a point summary of it

A common slip: "MAP gives the Bayesian answer." MAP gives a point estimate derived from the Bayesian posterior. The full Bayesian answer is the entire posterior distribution, which carries uncertainty information that the MAP discards. Treating MAP as the Bayesian answer is like treating the OLS coefficient as the frequentist answer without ever computing standard errors.

Watch Out

L2 regularization is not the same as ridge regression in every model

The L2-equals-Gaussian-prior identity holds whenever you have a parameter $\theta$ with prior $\mathcal N(0, \tau^2 I)$ and a likelihood that does not depend on $\theta$ through other terms. In logistic regression, the likelihood is not Gaussian (it is Bernoulli), but the L2-penalty form of penalized logistic regression is still MAP under a Gaussian prior on the coefficients; the math is the same. Same trick: the regularizer comes from the prior on the coefficients, not the likelihood.

Watch Out

Lasso is not a frequentist method that happens to coincide with a Laplace MAP

This is the standard textbook narrative: "L1 regularization is a frequentist sparsity-inducing penalty, and separately its MAP interpretation involves a Laplace prior." But the Bayesian view is the first-principles derivation: choosing a Laplace prior gives lasso. The L1 penalty is a consequence of the prior choice, not an independent construct. The narrative usually goes the other way for historical reasons (Tibshirani's 1996 paper introduced lasso as a constrained-optimization method, not as MAP), but the math works either direction.

Watch Out

MAP does not give credible intervals

Want a 95% credible interval for $\theta$ ? You need the posterior CDF, not just the mode. MAP is a single number. Some people compute MAP and the Hessian (second derivative of $\log \pi$ at MAP) and use the Laplace approximation $\mathcal N(\hat\theta_{\mathrm{MAP}}, H^{-1})$ as a credible-interval approximation; this is fine when the posterior is approximately Gaussian, but it can be very wrong when the posterior is skewed or multimodal. Use full posterior sampling or variational inference for intervals.

Summary

$\hat\theta_{\mathrm{MAP}} = \arg\max_\theta [\log p(D \mid \theta) + \log \pi(\theta)]$ ; the posterior mode.
Flat prior recovers MLE exactly.
Gaussian prior $\mathcal N(0, \tau^2 I)$ gives L2 regularization with $\lambda = \sigma^2/\tau^2$ (ridge).
Laplace prior $\mathrm{Lap}(0, b)$ gives L1 regularization with $\lambda = \sigma^2/b$ (lasso).
MAP and MLE coincide asymptotically (Bernstein-von Mises) but diverge in finite samples controlled by prior strength vs. $n$ .
MAP is not invariant to reparameterization; MLE is. Jeffreys's prior is the workaround.
MAP is the mode, not the mean, of the posterior. For skewed posteriors, the posterior mean is usually a better point summary.
MAP gives a point estimate, not uncertainty. Use the full posterior for intervals.

Exercises

ExerciseCore

Problem

Let $x_1, \dots, x_n \sim \mathcal N(\mu, 1)$ with known variance. Place the prior $\mu \sim \mathcal N(0, \tau^2)$ . Derive the MAP estimate of $\mu$ and compare it to the MLE and the posterior mean.

ExerciseCore

Problem

Derive the MAP estimator for logistic regression with a Gaussian prior on the coefficient vector. Specifically: $y_i \mid x_i, w \sim \mathrm{Bernoulli}(\sigma(w^\top x_i))$ with $\sigma$ the sigmoid, and $w \sim \mathcal N(0, \tau^2 I)$ . Write the objective function and identify the regularization strength.

ExerciseAdvanced

Problem

For the Beta-Bernoulli model, derive the MAP estimate $\hat\theta_{\mathrm{MAP}}$ and show that it can be written as a weighted average of the prior mode and the MLE. Use this to give an "effective sample size" interpretation of the prior.

ExerciseResearch

Problem

The Jeffreys prior is $\pi_J(\theta) \propto \sqrt{|I(\theta)|}$ , where $I$ is the Fisher information. Show that for the Bernoulli model, $\pi_J(\theta) \propto \theta^{-1/2}(1-\theta)^{-1/2}$ (a $\mathrm{Beta}(1/2, 1/2)$ density). Verify that the resulting MAP is invariant under the log-odds reparameterization $\eta = \log(\theta/(1-\theta))$ .

References

Canonical:

Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §1.2 (Bayesian probability theory, MAP), §3.1 (linear regression and the Gaussian/Laplace prior correspondence).
Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. §2.5 (modes and approximations), §4.1 (Laplace approximation).
Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer. Ch. 4.
Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation. Springer. Ch. 4.
Robert, C.P. (2007). The Bayesian Choice. Springer. Ch. 4 (loss functions and Bayes estimators).

Current:

Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. §4.5 (MAP estimation), §11.5 (Bayesian linear regression).
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." J. Royal Statistical Society B, 58(1):267–288. (The original lasso paper; the Bayesian interpretation is in §4.)
Park, T., & Casella, G. (2008). "The Bayesian Lasso." Journal of the American Statistical Association, 103(482):681–686. (Full Bayesian treatment of the Laplace prior; recovers lasso as MAP and provides credible intervals.)

Next Topics

Bayesian linear regression: full posterior derivation, not just the mode.
Conjugate priors: when MAP and the posterior mean have closed forms.
Ridge regression and lasso regression: the frequentist faces of the Gaussian and Laplace MAP estimators.

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Bayesian Estimationlayer 0B · tier 2

Derived topics

5

Conjugate Priorslayer 0B · tier 1
Ridge Regressionlayer 1 · tier 1
Bayesian Linear Regressionlayer 2 · tier 1
Lasso Regressionlayer 2 · tier 1
Regularization Theorylayer 2 · tier 2

Graph-backed continuations

Bayesian Linear Regression Conjugate Priors Ridge Regression Lasso Regression Regularization Theory