MLE vs. Method of Moments. Efficiency vs. Simplicity in Estimation

What Each Method Does

Both MLE and Method of Moments (MoM) estimate unknown parameters $\theta = (\theta_1, \ldots, \theta_k)$ of a statistical model from observed data $X_1, \ldots, X_n$ . They differ in what equation they solve to find $\hat{\theta}$ .

MLE finds the parameter value that makes the observed data most probable under the model.

MoM finds the parameter value that makes the population moments equal to the sample moments.

Side-by-Side Formulation

Definition

Maximum Likelihood Estimator

Given a parametric model $p(x|\theta)$ , the MLE solves:

$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log p(X_i|\theta)$

Equivalently, set the score function to zero:

$\sum_{i=1}^n \nabla_\theta \log p(X_i|\theta) = 0$

This is a system of $k$ equations in $k$ unknowns. The solution may require iterative optimization (Newton's method, EM, gradient ascent).

Definition

Method of Moments Estimator

Express the first $k$ population moments as functions of $\theta$ :

$\mu_j(\theta) = \mathbb{E}_\theta[X^j], \quad j = 1, \ldots, k$

The MoM estimator solves the system:

$\mu_j(\hat{\theta}_{\text{MoM}}) = \frac{1}{n}\sum_{i=1}^n X_i^j, \quad j = 1, \ldots, k$

This equates each population moment to its sample counterpart. With $k$ parameters and $k$ moment equations, you get a system of $k$ equations in $k$ unknowns.

Where Each Is Stronger

MLE wins on efficiency

The crown jewel of MLE theory is asymptotic efficiency. Under regularity conditions, the MLE achieves the Cramér-Rao lower bound as $n \to \infty$ :

$\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})$

where $I(\theta_0)$ is the Fisher information matrix. No consistent estimator can have smaller asymptotic variance. The MoM estimator is consistent and asymptotically normal, but its asymptotic variance is generally larger than $I(\theta_0)^{-1}$ .

MoM wins on simplicity and closed-form solutions

MoM often yields explicit closed-form estimators. For a Gamma distribution with shape $\alpha$ and rate $\beta$ :

$\mathbb{E}[X] = \alpha/\beta$ and $\mathbb{E}[X^2] = \alpha(\alpha+1)/\beta^2$
MoM: solve $\bar{X} = \hat{\alpha}/\hat{\beta}$ and $\overline{X^2} = \hat{\alpha}(\hat{\alpha}+1)/\hat{\beta}^2$
This gives $\hat{\beta} = \bar{X}/(\overline{X^2} - \bar{X}^2)$ and $\hat{\alpha} = \bar{X}\hat{\beta}$ --- fully explicit

The MLE for the Gamma distribution has no closed form and requires iterative numerical optimization (typically Newton-Raphson on the digamma function).

The Efficiency Gap

Theorem

Asymptotic Efficiency of MLE

Statement

Under regularity conditions, as $n \to \infty$ :

$\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})$

The MLE is asymptotically efficient: it achieves the Cramér-Rao lower bound $I(\theta_0)^{-1}$ for the asymptotic variance of any consistent estimator.

Intuition

The MLE extracts all information that the likelihood function contains about $\theta$ . The Fisher information $I(\theta)$ measures how much information each observation carries, and the MLE uses all of it. MoM, by contrast, uses only the information contained in the first $k$ moments, which is generally a strict subset of the information in the full likelihood.

report a correction →

How large is the gap? It depends on the model:

Distribution	MLE asymptotic variance	MoM asymptotic variance	Relative efficiency
Normal( $\mu$ , $\sigma^2$ )	$\sigma^2/n$ , $2\sigma^4/n$	Same	100%
Exponential( $\lambda$ )	$\lambda^2/n$	Same	100%
Gamma( $\alpha$ , $\beta$ )	$I(\theta)^{-1}/n$	Larger	$< 100\%$
Beta( $\alpha$ , $\beta$ )	$I(\theta)^{-1}/n$	Larger	Can be $< 50\%$

For the normal and exponential families, MoM and MLE coincide. For distributions where moments do not capture the full shape (e.g., heavy tails, skewness), the efficiency gap can be substantial.

Key Assumptions That Differ

	MLE	MoM
What it solves	Score equations $\nabla \ell = 0$	Moment equations $\mu_j = m_j$
Requires	Likelihood function in closed form	Moments as functions of $\theta$
Closed form	Rarely	Often
Computational cost	Iterative optimization	Usually algebraic
Asymptotic efficiency	Achieves Cramér-Rao bound	Generally sub-optimal
Invariance	Invariant to reparameterization	Not invariant
Existence/uniqueness	May have multiple local maxima	Solution may not exist or be unique

When MoM Is Actually Preferred

Example

Estimating mixture models as a warm start

For Gaussian mixture models, the likelihood surface has many local optima. MoM (via spectral methods on the moment tensor) provides a consistent starting point that is close to the true parameters. The MLE, initialized from MoM, then refines to an efficient estimate. MoM handles the global search; MLE handles the local refinement.

Example

Models where the likelihood is intractable

For some models (e.g., certain latent variable models, alpha-stable distributions), the likelihood function has no closed form. The density may involve intractable integrals. MoM requires only moments, which are often available in closed form even when the density is not.

Example

Robustness to model misspecification

If the model is wrong (as it always is in practice), the MLE converges to the parameter that minimizes KL divergence to the true data distribution. MoM converges to the parameter that matches the first $k$ moments. When the model is misspecified, moment matching can be more robust because it does not try to fit the entire distribution shape, only its low-order summary statistics.

Example

Quick estimation in the field

When you need a fast answer and computational resources are limited, MoM gives you an estimate with pencil and paper. The sample mean and variance are trivially computed, and for many models, the moment equations have explicit solutions. MLE may require writing optimization code.

Generalized Method of Moments (GMM)

When more moment conditions are available than parameters ( $m > k$ ), the system is over-determined. GMM minimizes a weighted quadratic form:

$\hat{\theta}_{\text{GMM}} = \arg\min_\theta \left(\frac{1}{n}\sum_i g(X_i, \theta)\right)^T W \left(\frac{1}{n}\sum_i g(X_i, \theta)\right)$

where $g(X_i, \theta)$ is a vector of moment conditions and $W$ is a weighting matrix. With the optimal weight matrix $W = \text{Var}(g)^{-1}$ , GMM achieves the semiparametric efficiency bound among moment-based estimators. GMM is the workhorse of econometrics.

Common Confusions

Watch Out

MoM can give impossible estimates

MoM estimates are not guaranteed to lie in the parameter space. For example, matching moments for a variance parameter can yield a negative estimate if the sample moments happen to fall in an incompatible region. MLE, by construction, always returns a parameter in the feasible set (assuming it is found via constrained optimization).

Watch Out

MLE efficiency requires a correctly specified model

The Cramér-Rao bound and asymptotic efficiency of MLE assume the model is correct. Under misspecification, the MLE converges to the pseudo-true parameter (KL projection), but it is no longer efficient in any meaningful sense. MoM may be more robust because it targets specific distributional features rather than the entire distribution.

Watch Out

Efficiency is an asymptotic property

In finite samples, the MLE can have higher bias or mean squared error than MoM, especially in small samples or near parameter boundaries. Asymptotic efficiency is a statement about the rate as $n \to \infty$ . For small $n$ , compare estimators via simulation, not asymptotic theory.

Modern ML Readout

MLE-style objectives dominate modern ML because they scale well with gradient-based optimization, even in highly non-convex settings. Cross-entropy classification, autoregressive language-model training, normalizing flows, variational lower bounds, and many diffusion objectives are all likelihood or approximate-likelihood training loops.

Spectral and method-of-moments approaches still matter, but mostly in a different role. They can provide identifiability arguments and provable guarantees for certain latent-variable models, and they can be useful for initialization or for special cases where likelihood optimization is intractable. They are rarely the competitive large-scale training default.

The practical lesson: use MLE or a likelihood surrogate when you can optimize it reliably; use MoM when you need closed-form structure, a principled warm start, robustness to specific moment targets, or a proof of identifiability before doing local likelihood refinement.

References

Canonical:

van der Vaart, Asymptotic Statistics (Cambridge UP, 1998), Chapters 5 and 7.
Lehmann & Casella, Theory of Point Estimation, 2nd ed. (Springer, 1998), Chapters 2 and 6.
Casella & Berger, Statistical Inference, 2nd ed. (Duxbury, 2002), Chapters 7 and 10.
Hansen, "Large Sample Properties of Generalized Method of Moments Estimators" (Econometrica, 1982).

Current / ML-facing:

Murphy, Probabilistic Machine Learning: An Introduction (MIT Press, 2022), Chapter 4.
Anandkumar, Ge, Hsu, Kakade, Telgarsky, "Tensor Decompositions for Learning Latent Variable Models" (JMLR, 2014).
Newey & McFadden, "Large Sample Estimation and Hypothesis Testing" (Handbook of Econometrics IV, 1994).