Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Method of Moments

Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.

CoreTier 2Stable~40 min
0

Why This Matters

The method of moments (MoM) is the oldest systematic approach to parameter estimation. Before MLE, before Bayesian methods, there was moment matching. You still encounter it constantly:

  • Initializing EM: Gaussian mixture models often use MoM to get a reasonable starting point before running EM
  • Latent variable models: when the likelihood is intractable, MoM (or its generalization, GMM) can bypass the likelihood entirely
  • Econometrics: GMM is the workhorse estimator when you have moment conditions but no full likelihood specification

MoM is also the simplest estimator to derive and understand. It builds intuition for what estimation is doing before you tackle the more sophisticated MLE theory.

Mental Model

You know that the population mean is E[X]=g1(θ)\mathbb{E}[X] = g_1(\theta), the population variance is Var(X)=g2(θ)\text{Var}(X) = g_2(\theta), and so on. You compute the sample mean, sample variance, etc. Then you solve the equations that set sample moments equal to population moments. The solutions are your parameter estimates.

It is algebra, not optimization.

Formal Setup and Notation

Let X1,,XnX_1, \ldots, X_n be i.i.d. from a distribution PθP_\theta with parameter θRd\theta \in \mathbb{R}^d. Define:

Definition

Population Moments

The kk-th population moment is:

μk(θ)=Eθ[Xk]\mu_k(\theta) = \mathbb{E}_\theta[X^k]

More generally, a population moment can be any known function of θ\theta: Eθ[g(X)]=m(θ)\mathbb{E}_\theta[g(X)] = m(\theta) for some function gg.

Definition

Sample Moments

The kk-th sample moment is:

μ^k=1ni=1nXik\hat{\mu}_k = \frac{1}{n}\sum_{i=1}^n X_i^k

By the law of large numbers, μ^kPμk(θ)\hat{\mu}_k \xrightarrow{P} \mu_k(\theta) as nn \to \infty.

Definition

Method of Moments Estimator

If θRd\theta \in \mathbb{R}^d, the method of moments estimator θ^MoM\hat{\theta}_{\text{MoM}} solves the system of dd equations:

μ^k=μk(θ^MoM)for k=1,,d\hat{\mu}_k = \mu_k(\hat{\theta}_{\text{MoM}}) \quad \text{for } k = 1, \ldots, d

That is, set the first dd sample moments equal to the corresponding population moments and solve for θ\theta.

Main Theorems

Theorem

Consistency of Method of Moments

Statement

If the function m:θ(μ1(θ),,μd(θ))m: \theta \mapsto (\mu_1(\theta), \ldots, \mu_d(\theta)) is continuous and has a continuous inverse m1m^{-1} in a neighborhood of θ\theta^*, then the method of moments estimator is consistent:

θ^MoMPθ\hat{\theta}_{\text{MoM}} \xrightarrow{P} \theta^*

Intuition

The sample moments converge to the population moments by the law of large numbers. If the map from parameters to moments is invertible and continuous, then inverting converging inputs gives converging outputs. This is just the continuous mapping theorem applied to moment matching.

Proof Sketch

By the law of large numbers, μ^kPμk(θ)\hat{\mu}_k \xrightarrow{P} \mu_k(\theta^*) for each k=1,,dk = 1, \ldots, d. The estimator is θ^=m1(μ^1,,μ^d)\hat{\theta} = m^{-1}(\hat{\mu}_1, \ldots, \hat{\mu}_d). Since m1m^{-1} is continuous, the continuous mapping theorem gives θ^Pm1(μ1(θ),,μd(θ))=θ\hat{\theta} \xrightarrow{P} m^{-1}(\mu_1(\theta^*), \ldots, \mu_d(\theta^*)) = \theta^*.

Why It Matters

MoM consistency is free: it only requires the law of large numbers and a smooth relationship between parameters and moments. You do not need to verify regularity conditions on the likelihood. This makes MoM applicable to models where MLE is difficult or undefined.

Failure Mode

If the mapping mm is not invertible (different parameters give the same moments), the MoM estimator is not identified. This happens in some mixture models where higher-order moments are needed for identification. Also, MoM can produce estimates outside the parameter space (e.g., a negative variance estimate).

Asymptotic Distribution

The MoM estimator is asymptotically normal. By the CLT, the vector of sample moments is asymptotically Gaussian. The delta method then gives:

n(θ^MoMθ)dN(0,ΣMoM)\sqrt{n}(\hat{\theta}_{\text{MoM}} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \Sigma_{\text{MoM}})

where ΣMoM=(Dm1)Cov(g(X))(Dm1)\Sigma_{\text{MoM}} = (Dm^{-1}) \, \text{Cov}(g(X)) \, (Dm^{-1})^\top and Dm1Dm^{-1} is the Jacobian of the inverse moment map.

In general, ΣMoMI(θ)1\Sigma_{\text{MoM}} \neq I(\theta)^{-1}. The MoM estimator is typically less efficient than MLE: it has larger asymptotic variance. The MLE achieves the Cramér-Rao bound; MoM usually does not.

Generalized Method of Moments (GMM)

Definition

Generalized Method of Moments

When you have more moment conditions than parameters (p>dp > d), the system is over-determined. GMM finds the parameter that minimizes a weighted norm of the moment violations:

θ^GMM=argminθ(1ni=1ng(Xi,θ))W(1ni=1ng(Xi,θ))\hat{\theta}_{\text{GMM}} = \arg\min_\theta \left(\frac{1}{n}\sum_{i=1}^n g(X_i, \theta)\right)^\top W \left(\frac{1}{n}\sum_{i=1}^n g(X_i, \theta)\right)

where g(X,θ)Rpg(X, \theta) \in \mathbb{R}^p is a vector of moment conditions satisfying E[g(X,θ)]=0\mathbb{E}[g(X, \theta^*)] = 0, and WW is a positive definite weight matrix.

The optimal weight matrix is W=Cov(g(X,θ))1W = \text{Cov}(g(X, \theta^*))^{-1}, which minimizes the asymptotic variance of θ^GMM\hat{\theta}_{\text{GMM}}. In practice, this is estimated in a two-step procedure: first estimate θ\theta with W=IW = I, then re-estimate with the estimated optimal WW.

When MoM Is Preferred Over MLE

MoM is not just a historical curiosity. There are real situations where it is the better choice:

  1. Closed-form solutions: MoM often gives explicit formulas where MLE requires numerical optimization (e.g., fitting a Gamma distribution)
  2. Computational simplicity: for initial parameter estimates or large-scale problems where likelihood evaluation is expensive
  3. Robustness to misspecification: MoM only requires certain moment conditions to hold, not the full distributional model. If the model is wrong but the moments are right, MoM remains consistent
  4. Intractable likelihoods: in latent variable models, the likelihood may involve intractable integrals, but moments may be computable

Canonical Examples

Example

MoM for Gaussian parameters

For XiN(μ,σ2)X_i \sim \mathcal{N}(\mu, \sigma^2), the first two population moments are μ1=μ\mu_1 = \mu and μ2=μ2+σ2\mu_2 = \mu^2 + \sigma^2.

Setting sample moments equal: μ^=Xˉ\hat{\mu} = \bar{X} and σ^2=1niXi2Xˉ2=1ni(XiXˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum_i X_i^2 - \bar{X}^2 = \frac{1}{n}\sum_i(X_i - \bar{X})^2.

The MoM estimator of the mean is the sample mean (same as MLE). The MoM estimator of variance divides by nn (same as MLE, but biased).

Example

MoM for Uniform(0, theta)

For XiUniform(0,θ)X_i \sim \text{Uniform}(0, \theta), E[X]=θ/2\mathbb{E}[X] = \theta/2. Setting Xˉ=θ^/2\bar{X} = \hat{\theta}/2 gives θ^MoM=2Xˉ\hat{\theta}_{\text{MoM}} = 2\bar{X}.

Compare with MLE: θ^MLE=X(n)=maxiXi\hat{\theta}_{\text{MLE}} = X_{(n)} = \max_i X_i. The MLE is more efficient (converges at rate O(1/n)O(1/n) vs O(1/n)O(1/\sqrt{n}) for MoM). This is an example where MoM is much worse than MLE.

Common Confusions

Watch Out

MoM is not always less efficient than MLE

For exponential families, the MLE and MoM often coincide (both use the sufficient statistic). The efficiency gap only appears for models where sufficient statistics are higher-dimensional than the parameter or where the moment map is far from optimal. For the Gaussian mean, MoM = MLE.

Watch Out

MoM can give impossible estimates

Nothing prevents MoM from returning a negative variance estimate or a probability outside [0,1][0,1]. Unlike MLE (which respects the model structure), MoM inverts algebraic equations without constraints. In practice, you clip or project the estimate to the valid parameter space.

Summary

  • MoM: set sample moments μ^k\hat{\mu}_k equal to population moments μk(θ)\mu_k(\theta) and solve for θ\theta
  • Consistency follows from the law of large numbers plus the continuous mapping theorem
  • MoM is asymptotically normal but typically less efficient than MLE
  • GMM handles over-identified models (more moment conditions than parameters) by minimizing a weighted quadratic form
  • MoM is preferred when: closed-form solutions exist, likelihoods are intractable, or robustness to misspecification is needed
  • MoM can produce out-of-range estimates; MLE respects model constraints

Exercises

ExerciseCore

Problem

Compute the method of moments estimator for the Poisson distribution: XiPoisson(λ)X_i \sim \text{Poisson}(\lambda). Compare it with the MLE.

ExerciseAdvanced

Problem

For the Gamma distribution with shape α\alpha and rate β\beta (so E[X]=α/β\mathbb{E}[X] = \alpha/\beta and Var(X)=α/β2\text{Var}(X) = \alpha/\beta^2), derive the MoM estimators for α\alpha and β\beta. Show that the MoM estimator of α\alpha can be negative for certain samples.

References

Canonical:

  • Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 7
  • van der Vaart, Asymptotic Statistics (1998), Chapter 5
  • Hansen, "Large Sample Properties of Generalized Method of Moments Estimators" (1982)

Current:

  • Wasserman, All of Statistics (2004), Chapter 9

  • Hall, Generalized Method of Moments (2005)

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Building on the method of moments:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics