Method of Moments

Sneiderman, Robby

Statistical Estimation

Method of Moments

Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.

CoreTier 2StableSupporting~40 min

Prerequisites

Common Probability Distributions

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

statistical-estimation | layer 0B | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The method of moments (MoM) is the oldest systematic approach to parameter estimation. Before MLE, before Bayesian methods, there was moment matching. You still encounter it constantly:

Initializing EM: Gaussian mixture models often use MoM to get a reasonable starting point before running EM
Latent variable models: when the likelihood is intractable, MoM (or its generalization, GMM) can bypass the likelihood entirely
Econometrics: GMM is the workhorse estimator when you have moment conditions but no full likelihood specification

MoM is also the simplest estimator to derive and understand. It builds intuition for what estimation is doing before you tackle the more sophisticated MLE theory.

Mental Model

You know that the population mean is $\mathbb{E}[X] = g_1(\theta)$ , the population variance is $\text{Var}(X) = g_2(\theta)$ , and so on. You compute the sample mean, sample variance, etc. Then you solve the equations that set sample moments equal to population moments. The solutions are your parameter estimates.

It is algebra, not optimization.

Formal Setup and Notation

Let $X_1, \ldots, X_n$ be i.i.d. from a distribution $P_\theta$ with parameter $\theta \in \mathbb{R}^d$ . Define:

Definition

Population Moments $μ_{k} (θ)$

The $k$ -th population moment is:

$\mu_k(\theta) = \mathbb{E}_\theta[X^k]$

More generally, a population moment can be any known function of $\theta$ : $\mathbb{E}_\theta[g(X)] = m(\theta)$ for some function $g$ .

Definition

Sample Moments $\overset{μ}{^}_{k}$

The $k$ -th sample moment is:

$\hat{\mu}_k = \frac{1}{n}\sum_{i=1}^n X_i^k$

By the law of large numbers, $\hat{\mu}_k \xrightarrow{P} \mu_k(\theta)$ as $n \to \infty$ .

Definition

Method of Moments Estimator

If $\theta \in \mathbb{R}^d$ , the method of moments estimator $\hat{\theta}_{\text{MoM}}$ solves the system of $d$ equations:

$\hat{\mu}_k = \mu_k(\hat{\theta}_{\text{MoM}}) \quad \text{for } k = 1, \ldots, d$

That is, set the first $d$ sample moments equal to the corresponding population moments and solve for $\theta$ .

Main Theorems

Theorem

Consistency of Method of Moments

Statement

If the function $m: \theta \mapsto (\mu_1(\theta), \ldots, \mu_d(\theta))$ is continuous and has a continuous inverse $m^{-1}$ in a neighborhood of $\theta^*$ , then the method of moments estimator is consistent:

$\hat{\theta}_{\text{MoM}} \xrightarrow{P} \theta^*$

Intuition

The sample moments converge to the population moments by the law of large numbers. If the map from parameters to moments is invertible and continuous, then inverting converging inputs gives converging outputs. This is just the continuous mapping theorem applied to moment matching.

Proof Sketch

By the law of large numbers, $\hat{\mu}_k \xrightarrow{P} \mu_k(\theta^*)$ for each $k = 1, \ldots, d$ . The estimator is $\hat{\theta} = m^{-1}(\hat{\mu}_1, \ldots, \hat{\mu}_d)$ . Since $m^{-1}$ is continuous, the continuous mapping theorem gives $\hat{\theta} \xrightarrow{P} m^{-1}(\mu_1(\theta^*), \ldots, \mu_d(\theta^*)) = \theta^*$ .

Why It Matters

MoM consistency is free: it only requires the law of large numbers and a smooth relationship between parameters and moments. You do not need to verify regularity conditions on the likelihood. This makes MoM applicable to models where MLE is difficult or undefined.

Failure Mode

If the mapping $m$ is not invertible (different parameters give the same moments), the MoM estimator is not identified. This happens in some mixture models where higher-order moments are needed for identification. Also, MoM can produce estimates outside the parameter space (e.g., a negative variance estimate).

report a correction →

Asymptotic Distribution

The MoM estimator is asymptotically normal. By the CLT, the vector of sample moments is asymptotically Gaussian. The delta method then gives:

$\sqrt{n}(\hat{\theta}_{\text{MoM}} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \Sigma_{\text{MoM}})$

where $\Sigma_{\text{MoM}} = (Dm^{-1}) \, \text{Cov}(g(X)) \, (Dm^{-1})^\top$ and $Dm^{-1}$ is the Jacobian of the inverse moment map.

In general, $\Sigma_{\text{MoM}} \neq I(\theta)^{-1}$ . The MoM estimator is typically less efficient than MLE: it has larger asymptotic variance. The MLE achieves the Cramér-Rao bound; MoM usually does not.

Generalized Method of Moments (GMM)

Definition

Generalized Method of Moments $\hat{θ}_{GMM}$

When you have more moment conditions than parameters ( $p > d$ ), the system is over-determined. GMM finds the parameter that minimizes a weighted norm of the moment violations:

$\hat{\theta}_{\text{GMM}} = \arg\min_\theta \left(\frac{1}{n}\sum_{i=1}^n g(X_i, \theta)\right)^\top W \left(\frac{1}{n}\sum_{i=1}^n g(X_i, \theta)\right)$

where $g(X, \theta) \in \mathbb{R}^p$ is a vector of moment conditions satisfying $\mathbb{E}[g(X, \theta^*)] = 0$ , and $W$ is a positive definite weight matrix.

The optimal weight matrix is $W = \text{Cov}(g(X, \theta^*))^{-1}$ , which minimizes the asymptotic variance of $\hat{\theta}_{\text{GMM}}$ . In practice, this is estimated in a two-step procedure: first estimate $\theta$ with $W = I$ , then re-estimate with the estimated optimal $W$ .

When MoM Is Preferred Over MLE

MoM is not just a historical curiosity. There are real situations where it is the better choice:

Closed-form solutions: MoM often gives explicit formulas where MLE requires numerical optimization (e.g., fitting a Gamma distribution)
Computational simplicity: for initial parameter estimates or large-scale problems where likelihood evaluation is expensive
Robustness to misspecification: MoM only requires certain moment conditions to hold, not the full distributional model. If the model is wrong but the moments are right, MoM remains consistent
Intractable likelihoods: in latent variable models, the likelihood may involve intractable integrals, but moments may be computable

Canonical Examples

Example

MoM for Gaussian parameters

For $X_i \sim \mathcal{N}(\mu, \sigma^2)$ , the first two population moments are $\mu_1 = \mu$ and $\mu_2 = \mu^2 + \sigma^2$ .

Setting sample moments equal: $\hat{\mu} = \bar{X}$ and $\hat{\sigma}^2 = \frac{1}{n}\sum_i X_i^2 - \bar{X}^2 = \frac{1}{n}\sum_i(X_i - \bar{X})^2$ .

The MoM estimator of the mean is the sample mean (same as MLE). The MoM estimator of variance divides by $n$ (same as MLE, but biased).

Example

MoM for Uniform(0, theta)

For $X_i \sim \text{Uniform}(0, \theta)$ , $\mathbb{E}[X] = \theta/2$ . Setting $\bar{X} = \hat{\theta}/2$ gives $\hat{\theta}_{\text{MoM}} = 2\bar{X}$ .

Compare with MLE: $\hat{\theta}_{\text{MLE}} = X_{(n)} = \max_i X_i$ . The MLE is more efficient (converges at rate $O(1/n)$ vs $O(1/\sqrt{n})$ for MoM). This is an example where MoM is much worse than MLE.

Common Confusions

Watch Out

MoM is not always less efficient than MLE

For exponential families, the MLE and MoM often coincide (both use the sufficient statistic). The efficiency gap only appears for models where sufficient statistics are higher-dimensional than the parameter or where the moment map is far from optimal. For the Gaussian mean, MoM = MLE.

Watch Out

MoM can give impossible estimates

Nothing prevents MoM from returning a negative variance estimate or a probability outside $[0,1]$ . Unlike MLE (which respects the model structure), MoM inverts algebraic equations without constraints. In practice, you clip or project the estimate to the valid parameter space.

Summary

MoM: set sample moments $\hat{\mu}_k$ equal to population moments $\mu_k(\theta)$ and solve for $\theta$
Consistency follows from the law of large numbers plus the continuous mapping theorem
MoM is asymptotically normal but typically less efficient than MLE
GMM handles over-identified models (more moment conditions than parameters) by minimizing a weighted quadratic form
MoM is preferred when: closed-form solutions exist, likelihoods are intractable, or robustness to misspecification is needed
MoM can produce out-of-range estimates; MLE respects model constraints

Exercises

ExerciseCore

Problem

Compute the method of moments estimator for the Poisson distribution: $X_i \sim \text{Poisson}(\lambda)$ . Compare it with the MLE.

ExerciseAdvanced

Problem

For the Gamma distribution with shape $\alpha$ and rate $\beta$ (so $\mathbb{E}[X] = \alpha/\beta$ and $\text{Var}(X) = \alpha/\beta^2$ ), derive the MoM estimators $\hat\alpha$ and $\hat\beta$ from the first two moments. Show that $\hat\alpha$ is always positive and explain why this is not a guarantee of finite-sample reasonableness: describe the sample configurations under which $\hat\alpha$ is uselessly small or large, and contrast with the MLE.

References

Canonical:

Casella & Berger, Statistical Inference (2nd ed., 2002), Chapter 7
van der Vaart, Asymptotic Statistics (1998), Chapter 5
Hansen, "Large Sample Properties of Generalized Method of Moments Estimators" (1982)

Current:

Wasserman, All of Statistics (2004), Chapter 9
Hall, Generalized Method of Moments (2005)
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Building on the method of moments:

Maximum likelihood estimation: the more efficient alternative, and when the two coincide
Hypothesis testing for ML: using moment conditions for test construction

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Common Probability Distributionslayer 0A · tier 1

Derived topics

2

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Hypothesis Testing for MLlayer 2 · tier 2

Graph-backed continuations

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency Hypothesis Testing for ML