Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

MLE vs. Method of Moments

Two classical estimation strategies: MLE maximizes the likelihood and is asymptotically efficient, while Method of Moments matches sample moments to population moments and is simpler but typically less efficient.

What Each Method Does

Both MLE and Method of Moments (MoM) estimate unknown parameters θ=(θ1,,θk)\theta = (\theta_1, \ldots, \theta_k) of a statistical model from observed data X1,,XnX_1, \ldots, X_n. They differ in what equation they solve to find θ^\hat{\theta}.

MLE finds the parameter value that makes the observed data most probable under the model.

MoM finds the parameter value that makes the population moments equal to the sample moments.

Side-by-Side Formulation

Definition

Maximum Likelihood Estimator

Given a parametric model p(xθ)p(x|\theta), the MLE solves:

θ^MLE=argmaxθi=1nlogp(Xiθ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log p(X_i|\theta)

Equivalently, set the score function to zero:

i=1nθlogp(Xiθ)=0\sum_{i=1}^n \nabla_\theta \log p(X_i|\theta) = 0

This is a system of kk equations in kk unknowns. The solution may require iterative optimization (Newton's method, EM, gradient ascent).

Definition

Method of Moments Estimator

Express the first kk population moments as functions of θ\theta:

μj(θ)=Eθ[Xj],j=1,,k\mu_j(\theta) = \mathbb{E}_\theta[X^j], \quad j = 1, \ldots, k

The MoM estimator solves the system:

μj(θ^MoM)=1ni=1nXij,j=1,,k\mu_j(\hat{\theta}_{\text{MoM}}) = \frac{1}{n}\sum_{i=1}^n X_i^j, \quad j = 1, \ldots, k

This equates each population moment to its sample counterpart. With kk parameters and kk moment equations, you get a system of kk equations in kk unknowns.

Where Each Is Stronger

MLE wins on efficiency

The crown jewel of MLE theory is asymptotic efficiency. Under regularity conditions, the MLE achieves the Cramér-Rao lower bound as nn \to \infty:

n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})

where I(θ0)I(\theta_0) is the Fisher information matrix. No consistent estimator can have smaller asymptotic variance. The MoM estimator is consistent and asymptotically normal, but its asymptotic variance is generally larger than I(θ0)1I(\theta_0)^{-1}.

MoM wins on simplicity and closed-form solutions

MoM often yields explicit closed-form estimators. For a Gamma distribution with shape α\alpha and rate β\beta:

The MLE for the Gamma distribution has no closed form and requires iterative numerical optimization (typically Newton-Raphson on the digamma function).

The Efficiency Gap

Theorem

Asymptotic Efficiency of MLE

Statement

Under regularity conditions, as nn \to \infty:

n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})

The MLE is asymptotically efficient: it achieves the Cramér-Rao lower bound I(θ0)1I(\theta_0)^{-1} for the asymptotic variance of any consistent estimator.

Intuition

The MLE extracts all information that the likelihood function contains about θ\theta. The Fisher information I(θ)I(\theta) measures how much information each observation carries, and the MLE uses all of it. MoM, by contrast, uses only the information contained in the first kk moments, which is generally a strict subset of the information in the full likelihood.

How large is the gap? It depends on the model:

DistributionMLE asymptotic varianceMoM asymptotic varianceRelative efficiency
Normal(μ\mu, σ2\sigma^2)σ2/n\sigma^2/n, 2σ4/n2\sigma^4/nSame100%
Exponential(λ\lambda)λ2/n\lambda^2/nSame100%
Gamma(α\alpha, β\beta)I(θ)1/nI(\theta)^{-1}/nLarger<100%< 100\%
Beta(α\alpha, β\beta)I(θ)1/nI(\theta)^{-1}/nLargerCan be <50%< 50\%

For the normal and exponential families, MoM and MLE coincide. For distributions where moments do not capture the full shape (e.g., heavy tails, skewness), the efficiency gap can be substantial.

Key Assumptions That Differ

MLEMoM
What it solvesScore equations =0\nabla \ell = 0Moment equations μj=mj\mu_j = m_j
RequiresLikelihood function in closed formMoments as functions of θ\theta
Closed formRarelyOften
Computational costIterative optimizationUsually algebraic
Asymptotic efficiencyAchieves Cramér-Rao boundGenerally sub-optimal
InvarianceInvariant to reparameterizationNot invariant
Existence/uniquenessMay have multiple local maximaSolution may not exist or be unique

When MoM Is Actually Preferred

Example

Estimating mixture models as a warm start

For Gaussian mixture models, the likelihood surface has many local optima. MoM (via spectral methods on the moment tensor) provides a consistent starting point that is close to the true parameters. The MLE, initialized from MoM, then refines to an efficient estimate. MoM handles the global search; MLE handles the local refinement.

Example

Models where the likelihood is intractable

For some models (e.g., certain latent variable models, alpha-stable distributions), the likelihood function has no closed form. The density may involve intractable integrals. MoM requires only moments, which are often available in closed form even when the density is not.

Example

Robustness to model misspecification

If the model is wrong (as it always is in practice), the MLE converges to the parameter that minimizes KL divergence to the true data distribution. MoM converges to the parameter that matches the first kk moments. When the model is misspecified, moment matching can be more robust because it does not try to fit the entire distribution shape, only its low-order summary statistics.

Example

Quick estimation in the field

When you need a fast answer and computational resources are limited, MoM gives you an estimate with pencil and paper. The sample mean and variance are trivially computed, and for many models, the moment equations have explicit solutions. MLE may require writing optimization code.

Generalized Method of Moments (GMM)

When more moment conditions are available than parameters (m>km > k), the system is over-determined. GMM minimizes a weighted quadratic form:

θ^GMM=argminθ(1nig(Xi,θ))TW(1nig(Xi,θ))\hat{\theta}_{\text{GMM}} = \arg\min_\theta \left(\frac{1}{n}\sum_i g(X_i, \theta)\right)^T W \left(\frac{1}{n}\sum_i g(X_i, \theta)\right)

where g(Xi,θ)g(X_i, \theta) is a vector of moment conditions and WW is a weighting matrix. With the optimal weight matrix W=Var(g)1W = \text{Var}(g)^{-1}, GMM achieves the semiparametric efficiency bound among moment-based estimators. GMM is the workhorse of econometrics.

Common Confusions

Watch Out

MoM can give impossible estimates

MoM estimates are not guaranteed to lie in the parameter space. For example, matching moments for a variance parameter can yield a negative estimate if the sample moments happen to fall in an incompatible region. MLE, by construction, always returns a parameter in the feasible set (assuming it is found via constrained optimization).

Watch Out

MLE efficiency requires a correctly specified model

The Cramér-Rao bound and asymptotic efficiency of MLE assume the model is correct. Under misspecification, the MLE converges to the pseudo-true parameter (KL projection), but it is no longer efficient in any meaningful sense. MoM may be more robust because it targets specific distributional features rather than the entire distribution.

Watch Out

Efficiency is an asymptotic property

In finite samples, the MLE can have higher bias or mean squared error than MoM, especially in small samples or near parameter boundaries. Asymptotic efficiency is a statement about the rate as nn \to \infty. For small nn, compare estimators via simulation, not asymptotic theory.