Statistical Estimation
Maximum A Posteriori (MAP) Estimation
Maximum a posteriori estimation as the posterior mode of a Bayesian model: derivation, the flat-prior recovery of MLE, the worked L2-norm-equals-Gaussian-prior and L1-norm-equals-Laplace-prior equivalences that make ridge and lasso Bayesian, and the invariance failure under reparameterization that distinguishes MAP from MLE.
Prerequisites
Why This Matters
MAP estimation is the bridge most ML learners cross when they realize that "regularization" is not an engineering trick but a Bayesian model with a non-flat prior. The L2 penalty in ridge regression is exactly a Gaussian prior on the weights, and the L1 penalty in lasso is exactly a Laplace prior. Recognizing this collapses two separate-looking pages (frequentist regularization vs. Bayesian inference) into one Bayesian framework with different prior choices.
Three things this page does that the MLE page and the Bayesian-estimation page on their own do not:
- Derives the L2-equals-Gaussian-prior equivalence in full.
- Derives the L1-equals-Laplace-prior equivalence in full (this derivation does not appear elsewhere on the site).
- Shows where MAP and MLE diverge: invariance under reparameterization, behavior in skewed posteriors, and asymptotic equivalence.
Mental Model
The frequentist MLE is the parameter that makes the observed data look least surprising:
The Bayesian posterior combines a prior with the likelihood. The MAP estimator is the posterior mode:
The only difference from MLE is the additive term. When the prior is flat (), MAP and MLE coincide. When the prior is informative, the prior pulls the estimate toward where it puts mass.
Reframing MAP as penalized MLE is the move that exposes ridge and lasso as Bayesian:
Set and the second term is , the ridge penalty. Set and it is , the lasso penalty. Same MLE machinery, different prior choice.
Formal Setup
Maximum a Posteriori Estimator
Given a likelihood , a prior , and observed data , the MAP estimator is
Equivalently, is the mode of the posterior distribution. The mode may not exist (improper posteriors, unbounded log-densities), may not be unique (multimodal posteriors), and may sit on the boundary of .
The MAP and the MLE differ by exactly one term: the log-prior. This is the only difference, and every distinguishing property of MAP traces back to it.
MAP as Penalized MLE
Statement
With the log-likelihood and the negative log-prior (the "regularizer"),
In particular:
- A flat prior ( constant) recovers MLE.
- A Gaussian prior gives plus a constant: an L2 penalty.
- A Laplace prior gives plus a constant: an L1 penalty.
Intuition
A prior is a soft constraint on the parameter space, expressed as a penalty in log-space. Maximizing the posterior is the same as minimizing "negative log-likelihood plus penalty." The choice of prior is the choice of penalty.
Proof Sketch
Take logs of the posterior: . The last term does not depend on , so . Substituting gives the form. For the three special cases: a flat prior gives const; a Gaussian prior gives const; a Laplace prior gives const.
Why It Matters
This single identity unifies regularized estimation across the entire ML toolkit. Ridge, lasso, elastic net, weight decay, dropout (under specific Bayesian interpretations), label smoothing, and entropy regularization are all MAP estimates under various priors. Treating regularization as Bayesian gives one way to choose the regularization strength (via empirical Bayes on the prior variance), to derive shrinkage formulas, and to propagate parameter uncertainty into predictions (which MLE-plus-regularization does not do).
Failure Mode
The MAP identity is purely a re-expression of the posterior mode. It says nothing about whether MAP is a good estimator. In high dimensions or with multimodal posteriors, the mode can be unrepresentative of the posterior; the posterior mean is usually a better summary. MAP also fails to propagate uncertainty: a point estimate is a point, and the posterior covariance information is discarded. Full Bayesian inference (sampling, variational, or conjugate closed-form posteriors) is the answer when uncertainty matters.
Ridge as MAP under a Gaussian Prior
The cleanest demonstration that "L2 penalty = Gaussian prior" is the linear regression case, but the algebra is the same in any model.
Ridge Regression as MAP under a Gaussian Prior
Statement
With Gaussian likelihood and Gaussian prior , the MAP estimator equals the ridge estimator with regularization strength :
Intuition
A tight Gaussian prior on (small ) is a strong belief that is near zero, which maps to a heavy ridge penalty (large ). A weak prior (large ) maps to a light penalty, approaching OLS as .
Proof Sketch
The negative log-posterior in up to a -independent constant is
Multiply by (positive, does not change the arg min):
This is the ridge objective with . Differentiate with respect to and set to zero: , giving .
Why It Matters
This identity removes the conceptual gap between "frequentist regularized regression" and "Bayesian Gaussian prior". The ridge solution is the posterior mode under a Gaussian prior. The Bayesian linear regression page goes further and derives the full posterior (mean, covariance, predictive distribution); ridge as MAP is just the mean of that posterior with a specific prior choice.
Failure Mode
The equivalence is structural. It does not mean that chosen this way produces good predictions, because the prior may be misspecified. Empirical Bayes (estimating from marginal likelihood) or cross-validation (estimating directly) are the two standard ways to pick the regularization strength. Treating as a free hyperparameter is fine; pretending it has a single "right" value derived from and a fixed is not.
Lasso as MAP under a Laplace Prior
The L1↔Laplace equivalence is structurally identical but uses a different prior density.
Lasso Regression as MAP under a Laplace Prior
Statement
With Gaussian likelihood and Laplace prior on each coefficient (density ), the MAP estimator equals the lasso estimator with regularization strength :
Equivalently, the lasso objective with penalty is the MAP under a Laplace prior with scale .
Intuition
A Laplace density is sharply peaked at zero; much more peaked than a Gaussian. Its log is , which is the L1 penalty (up to constants). A small scale means the prior strongly favors ; a large means the prior is diffuse. The Laplace's sharp peak at zero is what makes the resulting MAP estimate exactly sparse: the mode of the posterior literally lies at zero in the coordinates the data does not strongly pull away from zero. A Gaussian prior, by contrast, has a smooth peak at zero, so the MAP shrinks coefficients toward zero but never sets them exactly to zero. The sparsity of lasso is the sparsity of the posterior mode under a peaked prior.
Proof Sketch
The negative log-posterior in up to a -independent constant is
using for the product of Laplace densities. Multiply through by to get ; or factor differently to express as with rescaled appropriately. The arg min is the lasso solution.
Why It Matters
This identity closes the ridge/lasso asymmetry. Both are MAP estimators; they only differ in prior choice. The sparsity property of lasso is the sparsity property of the Laplace prior's posterior mode. The fact that lasso produces exact zeros (and ridge does not) traces directly to the Laplace prior's sharp kink at zero (a non-differentiable peak) versus the Gaussian prior's smooth peak.
Failure Mode
The MAP is the mode of the lasso posterior, not its mean. The Laplace prior gives a posterior whose mode is sparse but whose mean is not (the posterior mean for any coordinate that has nonzero posterior mass on negative and positive values cannot equal zero exactly). Full Bayesian lasso inference, like the horseshoe or spike-and-slab, gives credible intervals around each coefficient instead of a single point; but loses the literal-zero point estimate. If you want sparsity, use MAP; if you want uncertainty, use full Bayes. The two answers do not coincide.
Worked Bernoulli/Beta MAP
To see MAP without linear algebra, take a single-parameter coin example. Let with successes. Place the conjugate prior . The posterior is (see conjugate priors).
The posterior log-density up to a constant is
Differentiate and set to zero: , giving
Compare:
- MLE ( prior, which is flat): .
- Posterior mean: ; this is the right summary if you care about squared-error loss.
- MAP: ; this is the mode.
When (flat prior), the MAP simplifies to , recovering MLE exactly. When the prior is informative, the prior pseudo-counts and shift the estimate toward .
Numerical case. Coin: prior (mode at , ten pseudo-flips, five heads pseudo-five tails). Observed: 7 heads in 10 flips. MAP: . MLE: . Posterior mean: . The prior pulls the estimate from MLE 0.7 toward the prior mode 0.5; the posterior mean and the MAP differ slightly because the posterior is mildly skewed.
When MAP and MLE Diverge: Three Failure Modes
The MAP-vs-MLE difference is the log-prior term. Three places where this term changes the answer qualitatively, not just numerically.
1. Invariance under reparameterization
The MLE is invariant under one-to-one reparameterization: if for smooth bijection , then . The MAP is not: a flat prior in is not flat in (it picks up a Jacobian factor), and so the mode of the posterior in does not equal .
Concrete example. Bernoulli on . Reparameterize by the log-odds . The flat prior on transforms to on ; not flat. So MAP under a flat-in- prior gives one answer, while MAP under a flat-in- prior gives a different answer when transformed back. This is the standard objection that "MAP depends on parameterization" and is the reason MLE remains the default frequentist estimator and why Jeffreys-prior MAP exists (Jeffreys's prior is the unique prior whose induced MAP is invariant to reparameterization).
2. Skewed and multimodal posteriors
When the posterior is asymmetric or multimodal, the mode is a poor summary of the distribution's center. For a posterior, the mode is but the mean is . For a bimodal posterior, the mode picks one peak arbitrarily based on which has slightly higher density, hiding the other entirely.
The decision-theoretic version: MAP minimizes the Bayes risk under 0-1 loss (the loss is one if your estimate misses the truth at all, zero if it hits exactly). For practical loss functions, the posterior mean (squared-error loss) or median (absolute-error loss) is usually a better summary.
3. Asymptotic agreement, finite-sample disagreement
As , the posterior concentrates around the truth and the prior becomes negligible. By Bernstein-von Mises, the posterior is asymptotically Gaussian centered at the MLE, so MAP MLE in probability. In finite samples, the disagreement is controlled by the prior strength relative to the data: with and a prior worth 10 pseudo-samples, the prior is 50% of the information; with , the prior is 0.1% of the information and MAP MLE.
Common Confusions
The MAP is not the Bayesian posterior; it is a point summary of it
A common slip: "MAP gives the Bayesian answer." MAP gives a point estimate derived from the Bayesian posterior. The full Bayesian answer is the entire posterior distribution, which carries uncertainty information that the MAP discards. Treating MAP as the Bayesian answer is like treating the OLS coefficient as the frequentist answer without ever computing standard errors.
L2 regularization is not the same as ridge regression in every model
The L2-equals-Gaussian-prior identity holds whenever you have a parameter with prior and a likelihood that does not depend on through other terms. In logistic regression, the likelihood is not Gaussian (it is Bernoulli), but the L2-penalty form of penalized logistic regression is still MAP under a Gaussian prior on the coefficients; the math is the same. Same trick: the regularizer comes from the prior on the coefficients, not the likelihood.
Lasso is not a frequentist method that happens to coincide with a Laplace MAP
This is the standard textbook narrative: "L1 regularization is a frequentist sparsity-inducing penalty, and separately its MAP interpretation involves a Laplace prior." But the Bayesian view is the first-principles derivation: choosing a Laplace prior gives lasso. The L1 penalty is a consequence of the prior choice, not an independent construct. The narrative usually goes the other way for historical reasons (Tibshirani's 1996 paper introduced lasso as a constrained-optimization method, not as MAP), but the math works either direction.
MAP does not give credible intervals
Want a 95% credible interval for ? You need the posterior CDF, not just the mode. MAP is a single number. Some people compute MAP and the Hessian (second derivative of at MAP) and use the Laplace approximation as a credible-interval approximation; this is fine when the posterior is approximately Gaussian, but it can be very wrong when the posterior is skewed or multimodal. Use full posterior sampling or variational inference for intervals.
Summary
- ; the posterior mode.
- Flat prior recovers MLE exactly.
- Gaussian prior gives L2 regularization with (ridge).
- Laplace prior gives L1 regularization with (lasso).
- MAP and MLE coincide asymptotically (Bernstein-von Mises) but diverge in finite samples controlled by prior strength vs. .
- MAP is not invariant to reparameterization; MLE is. Jeffreys's prior is the workaround.
- MAP is the mode, not the mean, of the posterior. For skewed posteriors, the posterior mean is usually a better point summary.
- MAP gives a point estimate, not uncertainty. Use the full posterior for intervals.
Exercises
Problem
Let with known variance. Place the prior . Derive the MAP estimate of and compare it to the MLE and the posterior mean.
Problem
Derive the MAP estimator for logistic regression with a Gaussian prior on the coefficient vector. Specifically: with the sigmoid, and . Write the objective function and identify the regularization strength.
Problem
For the Beta-Bernoulli model, derive the MAP estimate and show that it can be written as a weighted average of the prior mode and the MLE. Use this to give an "effective sample size" interpretation of the prior.
Problem
The Jeffreys prior is , where is the Fisher information. Show that for the Bernoulli model, (a density). Verify that the resulting MAP is invariant under the log-odds reparameterization .
References
Canonical:
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §1.2 (Bayesian probability theory, MAP), §3.1 (linear regression and the Gaussian/Laplace prior correspondence).
- Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. §2.5 (modes and approximations), §4.1 (Laplace approximation).
- Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer. Ch. 4.
- Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation. Springer. Ch. 4.
- Robert, C.P. (2007). The Bayesian Choice. Springer. Ch. 4 (loss functions and Bayes estimators).
Current:
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. §4.5 (MAP estimation), §11.5 (Bayesian linear regression).
- Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." J. Royal Statistical Society B, 58(1):267–288. (The original lasso paper; the Bayesian interpretation is in §4.)
- Park, T., & Casella, G. (2008). "The Bayesian Lasso." Journal of the American Statistical Association, 103(482):681–686. (Full Bayesian treatment of the Laplace prior; recovers lasso as MAP and provides credible intervals.)
Next Topics
- Bayesian linear regression: full posterior derivation, not just the mode.
- Conjugate priors: when MAP and the posterior mean have closed forms.
- Ridge regression and lasso regression: the frequentist faces of the Gaussian and Laplace MAP estimators.
Last reviewed: May 10, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Common Probability Distributionslayer 0A · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Bayesian Estimationlayer 0B · tier 2
Derived topics
5- Conjugate Priorslayer 0B · tier 1
- Ridge Regressionlayer 1 · tier 1
- Bayesian Linear Regressionlayer 2 · tier 1
- Lasso Regressionlayer 2 · tier 1
- Regularization Theorylayer 2 · tier 2
Graph-backed continuations