Skip to main content

Statistical Estimation

Maximum A Posteriori (MAP) Estimation

Maximum a posteriori estimation as the posterior mode of a Bayesian model: derivation, the flat-prior recovery of MLE, the worked L2-norm-equals-Gaussian-prior and L1-norm-equals-Laplace-prior equivalences that make ridge and lasso Bayesian, and the invariance failure under reparameterization that distinguishes MAP from MLE.

CoreTier 1StableCore spine~70 min

Why This Matters

MAP estimation is the bridge most ML learners cross when they realize that "regularization" is not an engineering trick but a Bayesian model with a non-flat prior. The L2 penalty in ridge regression is exactly a Gaussian prior on the weights, and the L1 penalty in lasso is exactly a Laplace prior. Recognizing this collapses two separate-looking pages (frequentist regularization vs. Bayesian inference) into one Bayesian framework with different prior choices.

Three things this page does that the MLE page and the Bayesian-estimation page on their own do not:

  1. Derives the L2-equals-Gaussian-prior equivalence in full.
  2. Derives the L1-equals-Laplace-prior equivalence in full (this derivation does not appear elsewhere on the site).
  3. Shows where MAP and MLE diverge: invariance under reparameterization, behavior in skewed posteriors, and asymptotic equivalence.

Mental Model

The frequentist MLE is the parameter that makes the observed data look least surprising:

θ^MLE=argmaxθlogp(Dθ).\hat\theta_{\mathrm{MLE}} = \arg\max_\theta \, \log p(D \mid \theta).

The Bayesian posterior π(θD)p(Dθ)π(θ)\pi(\theta \mid D) \propto p(D \mid \theta) \pi(\theta) combines a prior π(θ)\pi(\theta) with the likelihood. The MAP estimator is the posterior mode:

θ^MAP=argmaxθlogπ(θD)=argmaxθ[logp(Dθ)+logπ(θ)].\hat\theta_{\mathrm{MAP}} = \arg\max_\theta \, \log \pi(\theta \mid D) = \arg\max_\theta \, \bigl[\log p(D \mid \theta) + \log \pi(\theta)\bigr].

The only difference from MLE is the additive logπ(θ)\log \pi(\theta) term. When the prior is flat (logπ(θ)=const\log \pi(\theta) = \text{const}), MAP and MLE coincide. When the prior is informative, the prior pulls the estimate toward where it puts mass.

Reframing MAP as penalized MLE is the move that exposes ridge and lasso as Bayesian:

θ^MAP=argminθ[logp(Dθ)logπ(θ)].\hat\theta_{\mathrm{MAP}} = \arg\min_\theta \, \bigl[ -\log p(D \mid \theta) - \log \pi(\theta) \bigr].

Set π(θ)exp(λθ22/2)\pi(\theta) \propto \exp(-\lambda \|\theta\|_2^2 / 2) and the second term is λθ22/2\lambda \|\theta\|_2^2 / 2, the ridge penalty. Set π(θ)exp(λθ1)\pi(\theta) \propto \exp(-\lambda \|\theta\|_1) and it is λθ1\lambda \|\theta\|_1, the lasso penalty. Same MLE machinery, different prior choice.

Formal Setup

Definition

Maximum a Posteriori Estimator

Given a likelihood p(Dθ)p(D \mid \theta), a prior π(θ)\pi(\theta), and observed data DD, the MAP estimator is

θ^MAP=argmaxθΘπ(θD)=argmaxθ[logp(Dθ)+logπ(θ)].\hat\theta_{\mathrm{MAP}} = \arg\max_{\theta \in \Theta} \pi(\theta \mid D) = \arg\max_\theta \, \bigl[\log p(D \mid \theta) + \log \pi(\theta)\bigr].

Equivalently, θ^MAP\hat\theta_{\mathrm{MAP}} is the mode of the posterior distribution. The mode may not exist (improper posteriors, unbounded log-densities), may not be unique (multimodal posteriors), and may sit on the boundary of Θ\Theta.

The MAP and the MLE differ by exactly one term: the log-prior. This is the only difference, and every distinguishing property of MAP traces back to it.

Proposition

MAP as Penalized MLE

Statement

With n(θ)=logp(Dθ)\ell_n(\theta) = \log p(D \mid \theta) the log-likelihood and r(θ)=logπ(θ)r(\theta) = -\log \pi(\theta) the negative log-prior (the "regularizer"),

θ^MAP=argminθ[n(θ)+r(θ)].\hat\theta_{\mathrm{MAP}} = \arg\min_\theta \, \bigl[ -\ell_n(\theta) + r(\theta) \bigr].

In particular:

  • A flat prior (rr constant) recovers MLE.
  • A Gaussian prior θN(0,τ2I)\theta \sim \mathcal N(0, \tau^2 I) gives r(θ)=12τ2θ22r(\theta) = \frac{1}{2\tau^2} \|\theta\|_2^2 plus a constant: an L2 penalty.
  • A Laplace prior θjiidLap(0,b)\theta_j \stackrel{\mathrm{iid}}{\sim} \mathrm{Lap}(0, b) gives r(θ)=1bθ1r(\theta) = \frac{1}{b} \|\theta\|_1 plus a constant: an L1 penalty.

Intuition

A prior is a soft constraint on the parameter space, expressed as a penalty in log-space. Maximizing the posterior is the same as minimizing "negative log-likelihood plus penalty." The choice of prior is the choice of penalty.

Proof Sketch

Take logs of the posterior: logπ(θD)=logp(Dθ)+logπ(θ)logp(D)\log \pi(\theta \mid D) = \log p(D \mid \theta) + \log \pi(\theta) - \log p(D). The last term does not depend on θ\theta, so argmaxθlogπ(θD)=argmaxθ[logp(Dθ)+logπ(θ)]=argminθ[n(θ)logπ(θ)]\arg\max_\theta \log \pi(\theta \mid D) = \arg\max_\theta [\log p(D \mid \theta) + \log \pi(\theta)] = \arg\min_\theta[-\ell_n(\theta) - \log \pi(\theta)]. Substituting r(θ)=logπ(θ)r(\theta) = -\log \pi(\theta) gives the form. For the three special cases: a flat prior gives r=r = const; a Gaussian prior gives r(θ)=12τ2θ22+r(\theta) = \frac{1}{2\tau^2}\|\theta\|_2^2 + const; a Laplace prior gives r(θ)=1bθ1+r(\theta) = \frac{1}{b}\|\theta\|_1 + const.

Why It Matters

This single identity unifies regularized estimation across the entire ML toolkit. Ridge, lasso, elastic net, weight decay, dropout (under specific Bayesian interpretations), label smoothing, and entropy regularization are all MAP estimates under various priors. Treating regularization as Bayesian gives one way to choose the regularization strength (via empirical Bayes on the prior variance), to derive shrinkage formulas, and to propagate parameter uncertainty into predictions (which MLE-plus-regularization does not do).

Failure Mode

The MAP identity is purely a re-expression of the posterior mode. It says nothing about whether MAP is a good estimator. In high dimensions or with multimodal posteriors, the mode can be unrepresentative of the posterior; the posterior mean is usually a better summary. MAP also fails to propagate uncertainty: a point estimate is a point, and the posterior covariance information is discarded. Full Bayesian inference (sampling, variational, or conjugate closed-form posteriors) is the answer when uncertainty matters.

Ridge as MAP under a Gaussian Prior

The cleanest demonstration that "L2 penalty = Gaussian prior" is the linear regression case, but the algebra is the same in any model.

Theorem

Ridge Regression as MAP under a Gaussian Prior

Statement

With Gaussian likelihood yX,wN(Xw,σ2I)y \mid X, w \sim \mathcal N(Xw, \sigma^2 I) and Gaussian prior wN(0,τ2I)w \sim \mathcal N(0, \tau^2 I), the MAP estimator equals the ridge estimator with regularization strength λ=σ2/τ2\lambda = \sigma^2 / \tau^2:

w^MAP=argminw12σ2yXw22+12τ2w22=(XX+σ2τ2I)1Xy.\hat w_{\mathrm{MAP}} = \arg\min_w \tfrac1{2\sigma^2} \|y - Xw\|_2^2 + \tfrac1{2\tau^2}\|w\|_2^2 = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y.

Intuition

A tight Gaussian prior on ww (small τ2\tau^2) is a strong belief that ww is near zero, which maps to a heavy ridge penalty (large λ\lambda). A weak prior (large τ2\tau^2) maps to a light penalty, approaching OLS as τ2\tau^2 \to \infty.

Proof Sketch

The negative log-posterior in ww up to a ww-independent constant is

logp(yX,w)logπ(w)=12σ2yXw2+12τ2w2+const.-\log p(y \mid X, w) - \log \pi(w) = \tfrac1{2\sigma^2}\|y - Xw\|^2 + \tfrac1{2\tau^2}\|w\|^2 + \text{const}.

Multiply by 2σ22\sigma^2 (positive, does not change the arg min):

yXw2+σ2τ2w2.\|y - Xw\|^2 + \tfrac{\sigma^2}{\tau^2}\|w\|^2.

This is the ridge objective with λ=σ2/τ2\lambda = \sigma^2/\tau^2. Differentiate with respect to ww and set to zero: 2X(yXw)+2σ2τ2w=0-2 X^\top(y - Xw) + 2 \tfrac{\sigma^2}{\tau^2} w = 0, giving w^=(XX+σ2τ2I)1Xy\hat w = (X^\top X + \tfrac{\sigma^2}{\tau^2} I)^{-1} X^\top y.

Why It Matters

This identity removes the conceptual gap between "frequentist regularized regression" and "Bayesian Gaussian prior". The ridge solution is the posterior mode under a Gaussian prior. The Bayesian linear regression page goes further and derives the full posterior (mean, covariance, predictive distribution); ridge as MAP is just the mean of that posterior with a specific prior choice.

Failure Mode

The equivalence is structural. It does not mean that λ=σ2/τ2\lambda = \sigma^2/\tau^2 chosen this way produces good predictions, because the prior τ2\tau^2 may be misspecified. Empirical Bayes (estimating τ2\tau^2 from marginal likelihood) or cross-validation (estimating λ\lambda directly) are the two standard ways to pick the regularization strength. Treating λ\lambda as a free hyperparameter is fine; pretending it has a single "right" value derived from σ\sigma and a fixed τ\tau is not.

Lasso as MAP under a Laplace Prior

The L1↔Laplace equivalence is structurally identical but uses a different prior density.

Theorem

Lasso Regression as MAP under a Laplace Prior

Statement

With Gaussian likelihood yX,wN(Xw,σ2I)y \mid X, w \sim \mathcal N(Xw, \sigma^2 I) and Laplace prior on each coefficient wjiidLap(0,b)w_j \stackrel{\mathrm{iid}}{\sim} \mathrm{Lap}(0, b) (density 12bexp(wj/b)\tfrac1{2b} \exp(-|w_j|/b)), the MAP estimator equals the lasso estimator with regularization strength λ=σ2/b\lambda = \sigma^2/b:

w^MAP=argminw12σ2yXw22+1bw1.\hat w_{\mathrm{MAP}} = \arg\min_w \tfrac1{2\sigma^2}\|y - Xw\|_2^2 + \tfrac1{b}\|w\|_1.

Equivalently, the lasso objective with penalty λw1\lambda \|w\|_1 is the MAP under a Laplace prior with scale b=σ2/λb = \sigma^2/\lambda.

Intuition

A Laplace density 12bexp(w/b)\tfrac1{2b}\exp(-|w|/b) is sharply peaked at zero; much more peaked than a Gaussian. Its log is wblog(2b)-\frac{|w|}{b} - \log(2b), which is the L1 penalty (up to constants). A small scale bb means the prior strongly favors w=0w = 0; a large bb means the prior is diffuse. The Laplace's sharp peak at zero is what makes the resulting MAP estimate exactly sparse: the mode of the posterior literally lies at zero in the coordinates the data does not strongly pull away from zero. A Gaussian prior, by contrast, has a smooth peak at zero, so the MAP shrinks coefficients toward zero but never sets them exactly to zero. The sparsity of lasso is the sparsity of the posterior mode under a peaked prior.

Proof Sketch

The negative log-posterior in ww up to a ww-independent constant is

logp(yX,w)logπ(w)=12σ2yXw2+1bw1+const,-\log p(y \mid X, w) - \log \pi(w) = \tfrac1{2\sigma^2}\|y - Xw\|^2 + \tfrac1{b}\|w\|_1 + \text{const},

using logπ(w)=1bw1+const\log \pi(w) = -\frac1b \|w\|_1 + \text{const} for the product of Laplace densities. Multiply through by 2σ22\sigma^2 to get yXw2+2σ2bw1\|y - Xw\|^2 + \frac{2\sigma^2}{b}\|w\|_1; or factor differently to express as 12nyXw2+λw1\tfrac1{2n}\|y - Xw\|^2 + \lambda \|w\|_1 with λ\lambda rescaled appropriately. The arg min is the lasso solution.

Why It Matters

This identity closes the ridge/lasso asymmetry. Both are MAP estimators; they only differ in prior choice. The sparsity property of lasso is the sparsity property of the Laplace prior's posterior mode. The fact that lasso produces exact zeros (and ridge does not) traces directly to the Laplace prior's sharp kink at zero (a non-differentiable peak) versus the Gaussian prior's smooth peak.

Failure Mode

The MAP is the mode of the lasso posterior, not its mean. The Laplace prior gives a posterior whose mode is sparse but whose mean is not (the posterior mean for any coordinate that has nonzero posterior mass on negative and positive values cannot equal zero exactly). Full Bayesian lasso inference, like the horseshoe or spike-and-slab, gives credible intervals around each coefficient instead of a single point; but loses the literal-zero point estimate. If you want sparsity, use MAP; if you want uncertainty, use full Bayes. The two answers do not coincide.

Worked Bernoulli/Beta MAP

To see MAP without linear algebra, take a single-parameter coin example. Let x1,,xniidBernoulli(θ)x_1, \dots, x_n \stackrel{\mathrm{iid}}{\sim} \mathrm{Bernoulli}(\theta) with k=xik = \sum x_i successes. Place the conjugate prior θBeta(α,β)\theta \sim \mathrm{Beta}(\alpha, \beta). The posterior is Beta(α+k,β+nk)\mathrm{Beta}(\alpha + k, \beta + n - k) (see conjugate priors).

The posterior log-density up to a constant is

logπ(θx)=(α+k1)logθ+(β+nk1)log(1θ)+const.\log \pi(\theta \mid x) = (\alpha + k - 1)\log \theta + (\beta + n - k - 1)\log(1-\theta) + \text{const}.

Differentiate and set to zero: α+k1θβ+nk11θ=0\frac{\alpha + k - 1}{\theta} - \frac{\beta + n - k - 1}{1-\theta} = 0, giving

θ^MAP=α+k1α+β+n2.\hat\theta_{\mathrm{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}.

Compare:

  • MLE (Beta(1,1)\mathrm{Beta}(1, 1) prior, which is flat): θ^MLE=k/n\hat\theta_{\mathrm{MLE}} = k/n.
  • Posterior mean: E[θx]=(α+k)/(α+β+n)\mathbb E[\theta \mid x] = (\alpha + k)/(\alpha + \beta + n); this is the right summary if you care about squared-error loss.
  • MAP: (α+k1)/(α+β+n2)(\alpha + k - 1)/(\alpha + \beta + n - 2); this is the mode.

When α=β=1\alpha = \beta = 1 (flat prior), the MAP simplifies to k/nk/n, recovering MLE exactly. When the prior is informative, the prior pseudo-counts α1\alpha - 1 and β1\beta - 1 shift the estimate toward α/(α+β)\alpha / (\alpha + \beta).

Numerical case. Coin: prior Beta(5,5)\mathrm{Beta}(5, 5) (mode at 1/21/2, ten pseudo-flips, five heads pseudo-five tails). Observed: 7 heads in 10 flips. MAP: (5+71)/(5+5+102)=11/180.611(5 + 7 - 1)/(5 + 5 + 10 - 2) = 11/18 \approx 0.611. MLE: 0.7000.700. Posterior mean: (5+7)/(5+5+10)=12/20=0.600(5 + 7)/(5 + 5 + 10) = 12/20 = 0.600. The prior pulls the estimate from MLE 0.7 toward the prior mode 0.5; the posterior mean and the MAP differ slightly because the posterior Beta(12,8)\mathrm{Beta}(12, 8) is mildly skewed.

When MAP and MLE Diverge: Three Failure Modes

The MAP-vs-MLE difference is the log-prior term. Three places where this term changes the answer qualitatively, not just numerically.

1. Invariance under reparameterization

The MLE is invariant under one-to-one reparameterization: if η=g(θ)\eta = g(\theta) for smooth bijection gg, then η^MLE=g(θ^MLE)\hat\eta_{\mathrm{MLE}} = g(\hat\theta_{\mathrm{MLE}}). The MAP is not: a flat prior in θ\theta is not flat in η\eta (it picks up a Jacobian factor), and so the mode of the posterior in η\eta does not equal g(θ^MAP)g(\hat\theta_{\mathrm{MAP}}).

Concrete example. Bernoulli θ\theta on (0,1)(0, 1). Reparameterize by the log-odds η=logθ1θ\eta = \log\frac{\theta}{1-\theta}. The flat prior π(θ)=1\pi(\theta) = 1 on (0,1)(0, 1) transforms to π(η)=eη/(1+eη)2\pi(\eta) = e^\eta/(1 + e^\eta)^2 on R\mathbb R; not flat. So MAP under a flat-in-θ\theta prior gives one answer, while MAP under a flat-in-η\eta prior gives a different answer when transformed back. This is the standard objection that "MAP depends on parameterization" and is the reason MLE remains the default frequentist estimator and why Jeffreys-prior MAP exists (Jeffreys's prior π(θ)I(θ)\pi(\theta) \propto \sqrt{I(\theta)} is the unique prior whose induced MAP is invariant to reparameterization).

2. Skewed and multimodal posteriors

When the posterior is asymmetric or multimodal, the mode is a poor summary of the distribution's center. For a Beta(2,5)\mathrm{Beta}(2, 5) posterior, the mode is 1/5=0.21/5 = 0.2 but the mean is 2/70.2862/7 \approx 0.286. For a bimodal posterior, the mode picks one peak arbitrarily based on which has slightly higher density, hiding the other entirely.

The decision-theoretic version: MAP minimizes the Bayes risk under 0-1 loss (the loss is one if your estimate misses the truth at all, zero if it hits exactly). For practical loss functions, the posterior mean (squared-error loss) or median (absolute-error loss) is usually a better summary.

3. Asymptotic agreement, finite-sample disagreement

As nn \to \infty, the posterior concentrates around the truth and the prior becomes negligible. By Bernstein-von Mises, the posterior is asymptotically Gaussian centered at the MLE, so MAP \to MLE in probability. In finite samples, the disagreement is controlled by the prior strength relative to the data: with n=10n = 10 and a prior worth 10 pseudo-samples, the prior is 50% of the information; with n=10000n = 10\,000, the prior is 0.1% of the information and MAP \approx MLE.

Common Confusions

Watch Out

The MAP is not the Bayesian posterior; it is a point summary of it

A common slip: "MAP gives the Bayesian answer." MAP gives a point estimate derived from the Bayesian posterior. The full Bayesian answer is the entire posterior distribution, which carries uncertainty information that the MAP discards. Treating MAP as the Bayesian answer is like treating the OLS coefficient as the frequentist answer without ever computing standard errors.

Watch Out

L2 regularization is not the same as ridge regression in every model

The L2-equals-Gaussian-prior identity holds whenever you have a parameter θ\theta with prior N(0,τ2I)\mathcal N(0, \tau^2 I) and a likelihood that does not depend on θ\theta through other terms. In logistic regression, the likelihood is not Gaussian (it is Bernoulli), but the L2-penalty form of penalized logistic regression is still MAP under a Gaussian prior on the coefficients; the math is the same. Same trick: the regularizer comes from the prior on the coefficients, not the likelihood.

Watch Out

Lasso is not a frequentist method that happens to coincide with a Laplace MAP

This is the standard textbook narrative: "L1 regularization is a frequentist sparsity-inducing penalty, and separately its MAP interpretation involves a Laplace prior." But the Bayesian view is the first-principles derivation: choosing a Laplace prior gives lasso. The L1 penalty is a consequence of the prior choice, not an independent construct. The narrative usually goes the other way for historical reasons (Tibshirani's 1996 paper introduced lasso as a constrained-optimization method, not as MAP), but the math works either direction.

Watch Out

MAP does not give credible intervals

Want a 95% credible interval for θ\theta? You need the posterior CDF, not just the mode. MAP is a single number. Some people compute MAP and the Hessian (second derivative of logπ\log \pi at MAP) and use the Laplace approximation N(θ^MAP,H1)\mathcal N(\hat\theta_{\mathrm{MAP}}, H^{-1}) as a credible-interval approximation; this is fine when the posterior is approximately Gaussian, but it can be very wrong when the posterior is skewed or multimodal. Use full posterior sampling or variational inference for intervals.

Summary

  • θ^MAP=argmaxθ[logp(Dθ)+logπ(θ)]\hat\theta_{\mathrm{MAP}} = \arg\max_\theta [\log p(D \mid \theta) + \log \pi(\theta)]; the posterior mode.
  • Flat prior recovers MLE exactly.
  • Gaussian prior N(0,τ2I)\mathcal N(0, \tau^2 I) gives L2 regularization with λ=σ2/τ2\lambda = \sigma^2/\tau^2 (ridge).
  • Laplace prior Lap(0,b)\mathrm{Lap}(0, b) gives L1 regularization with λ=σ2/b\lambda = \sigma^2/b (lasso).
  • MAP and MLE coincide asymptotically (Bernstein-von Mises) but diverge in finite samples controlled by prior strength vs. nn.
  • MAP is not invariant to reparameterization; MLE is. Jeffreys's prior is the workaround.
  • MAP is the mode, not the mean, of the posterior. For skewed posteriors, the posterior mean is usually a better point summary.
  • MAP gives a point estimate, not uncertainty. Use the full posterior for intervals.

Exercises

ExerciseCore

Problem

Let x1,,xnN(μ,1)x_1, \dots, x_n \sim \mathcal N(\mu, 1) with known variance. Place the prior μN(0,τ2)\mu \sim \mathcal N(0, \tau^2). Derive the MAP estimate of μ\mu and compare it to the MLE and the posterior mean.

ExerciseCore

Problem

Derive the MAP estimator for logistic regression with a Gaussian prior on the coefficient vector. Specifically: yixi,wBernoulli(σ(wxi))y_i \mid x_i, w \sim \mathrm{Bernoulli}(\sigma(w^\top x_i)) with σ\sigma the sigmoid, and wN(0,τ2I)w \sim \mathcal N(0, \tau^2 I). Write the objective function and identify the regularization strength.

ExerciseAdvanced

Problem

For the Beta-Bernoulli model, derive the MAP estimate θ^MAP\hat\theta_{\mathrm{MAP}} and show that it can be written as a weighted average of the prior mode and the MLE. Use this to give an "effective sample size" interpretation of the prior.

ExerciseResearch

Problem

The Jeffreys prior is πJ(θ)I(θ)\pi_J(\theta) \propto \sqrt{|I(\theta)|}, where II is the Fisher information. Show that for the Bernoulli model, πJ(θ)θ1/2(1θ)1/2\pi_J(\theta) \propto \theta^{-1/2}(1-\theta)^{-1/2} (a Beta(1/2,1/2)\mathrm{Beta}(1/2, 1/2) density). Verify that the resulting MAP is invariant under the log-odds reparameterization η=log(θ/(1θ))\eta = \log(\theta/(1-\theta)).

References

Canonical:

  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. §1.2 (Bayesian probability theory, MAP), §3.1 (linear regression and the Gaussian/Laplace prior correspondence).
  • Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. §2.5 (modes and approximations), §4.1 (Laplace approximation).
  • Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer. Ch. 4.
  • Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation. Springer. Ch. 4.
  • Robert, C.P. (2007). The Bayesian Choice. Springer. Ch. 4 (loss functions and Bayes estimators).

Current:

  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. §4.5 (MAP estimation), §11.5 (Bayesian linear regression).
  • Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." J. Royal Statistical Society B, 58(1):267–288. (The original lasso paper; the Bayesian interpretation is in §4.)
  • Park, T., & Casella, G. (2008). "The Bayesian Lasso." Journal of the American Statistical Association, 103(482):681–686. (Full Bayesian treatment of the Laplace prior; recovers lasso as MAP and provides credible intervals.)

Next Topics

Last reviewed: May 10, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

5