Skip to main content

Foundations

Normal Distribution

The Normal distribution as a parametric family: density, moment generating function, closure under affine transformations and sums, MLE for mean and variance, Fisher information, and the bridge to the Chi-squared, Student-t, and F sampling distributions.

EssentialCoreTier 1StableCore spine~60 min
For:MLStatsActuarialGeneral

Why This Matters

The Normal distribution is the most common parametric model in statistics, and most of its dominance comes from one fact: linear functions of independent Normal random variables stay Normal. That closure property is what makes the central limit theorem usable and what makes the sample mean of any Normal sample tractable. The classical sampling distributions, Chi-squared, Student-t, and F, are all built by combining independent Normals through squaring, root scaling, and ratios. The Normal is also the maximum-entropy distribution on R\mathbb{R} subject to a fixed mean and variance, which is why it appears every time you regularize a model and stop there.

Knowing the Normal well means knowing five facts cold: its density, its MGF, its closure under affine maps and independent sums, its MLE, and the joint independence of the sample mean and the sample variance for an i.i.d. Normal sample. The remainder of this page derives those five and connects each to a downstream test or model.

Definition

Definition

Normal Distribution

A random variable XX has a Normal distribution with mean μR\mu\in\mathbb{R} and variance σ2>0\sigma^2>0 if its density is

fX(x)=12πσ2exp ⁣((xμ)22σ2),xR.f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right),\qquad x\in\mathbb{R}.

The case μ=0\mu=0, σ2=1\sigma^2=1 is the standard Normal, written ZN(0,1)Z\sim\mathcal{N}(0,1). The standard Normal CDF is denoted Φ\Phi and its density φ\varphi.

A Normal random variable is supported on all of R\mathbb{R} and has positive density everywhere; no value is impossible, although values more than four or five standard deviations from the mean have very small probability. The density is symmetric about μ\mu, has its maximum there, and falls off at a Gaussian rate, which is faster than any polynomial and faster than any subexponential.

Density Normalizes to One

Proposition

Normal Density Integrates to One

Statement

12πσ2exp ⁣((xμ)22σ2)dx=1.\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)dx = 1.

Intuition

After centering and scaling, the integrand becomes ez2/2/2πe^{-z^2/2}/\sqrt{2\pi}. The classical trick squares the integral and computes the resulting double integral in polar coordinates.

Proof Sketch

Let I=ez2/2dzI = \int_{-\infty}^{\infty} e^{-z^2/2}\,dz. Then I2=R2e(z12+z22)/2dz1dz2=02π0er2/2rdrdθ=2π.I^2 = \int_{\mathbb{R}^2} e^{-(z_1^2+z_2^2)/2}\,dz_1\,dz_2 = \int_0^{2\pi}\int_0^\infty e^{-r^2/2}\,r\,dr\,d\theta = 2\pi. So I=2πI = \sqrt{2\pi}. Substituting z=(xμ)/σz = (x-\mu)/\sigma in the original integral gives 2πσ2\sqrt{2\pi\sigma^2}, which cancels the prefactor.

Why It Matters

The factor 2πσ2\sqrt{2\pi\sigma^2} is not optional. Drop it and the density does not normalize, every downstream probability and expectation is wrong by a constant, and the log-likelihood that drives MLE for the Normal is off by the same constant. The constant matters for likelihood ratios across σ2\sigma^2 values but cancels for likelihood ratios at fixed σ2\sigma^2.

Failure Mode

The polar-coordinate trick uses the rotational symmetry of e(z12+z22)/2e^{-(z_1^2+z_2^2)/2}. No analogous trick works for non-Gaussian densities. Substituting z=(xμ)/σz = (x-\mu)/\sigma requires σ>0\sigma>0; the degenerate σ=0\sigma = 0 case is a point mass at μ\mu, not a Normal.

Moments and MGF

Theorem

Normal Moment Generating Function

Statement

For XN(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^2) and every sRs\in\mathbb{R}, MX(s)=E[esX]=exp ⁣(μs+σ2s22).M_X(s) = \mathbb{E}[e^{sX}] = \exp\!\left(\mu s + \frac{\sigma^2 s^2}{2}\right).

Intuition

The MGF is a quadratic in ss in the log scale, with linear coefficient μ\mu and quadratic coefficient σ2/2\sigma^2/2. The log-MGF being quadratic is the defining property of the Normal family: any random variable whose log-MGF is quadratic is Normal.

Proof Sketch

Write X=μ+σZX = \mu + \sigma Z for ZN(0,1)Z\sim\mathcal{N}(0,1). Then esX=esμesσZe^{sX} = e^{s\mu}e^{s\sigma Z}, so MX(s)=esμesσzez2/22πdz.M_X(s) = e^{s\mu}\int_{-\infty}^{\infty}e^{s\sigma z}\,\frac{e^{-z^2/2}}{\sqrt{2\pi}}\,dz. Complete the square in the exponent: sσzz2/2=(zsσ)2/2+s2σ2/2s\sigma z - z^2/2 = -(z - s\sigma)^2/2 + s^2\sigma^2/2. Shift the integration variable. The Gaussian integral evaluates to 2π\sqrt{2\pi}, leaving esμ+s2σ2/2e^{s\mu+s^2\sigma^2/2}.

Why It Matters

The mean and variance read directly off the MGF: E[X]=MX(0)=μ\mathbb{E}[X] = M_X'(0) = \mu and Var(X)=MX(0)MX(0)2=σ2\operatorname{Var}(X) = M_X''(0) - M_X'(0)^2 = \sigma^2. The MGF is finite for every real ss, which is stronger than the polynomial-moment condition; it forces sub-Gaussian tails. See sub-Gaussian random variables for the corresponding tail bound.

Failure Mode

The MGF is the exponential of a quadratic only for the Normal. If you compute the MGF of a sample and find a quadratic log-MGF you have empirical evidence for Normality, but a finite-sample MGF estimate is noisy. The MGF tool is for identification of distributions from algebraic forms, not for goodness-of-fit testing; use goodness-of-fit tests for that.

Closure Under Affine Maps and Independent Sums

Theorem

Affine and Sum Closure

Statement

  1. Affine. If XN(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^2) and a,bRa,b\in\mathbb{R} with a0a\ne 0, then aX+bN(aμ+b,a2σ2).aX + b \sim \mathcal{N}(a\mu + b,\, a^2\sigma^2).
  2. Sum of independents. If XN(μX,σX2)X\sim\mathcal{N}(\mu_X,\sigma_X^2) and YN(μY,σY2)Y\sim\mathcal{N}(\mu_Y,\sigma_Y^2) are independent, then X+YN(μX+μY,σX2+σY2).X + Y \sim \mathcal{N}(\mu_X+\mu_Y,\, \sigma_X^2+\sigma_Y^2).

Intuition

Closure under affine maps is just a change of location and scale on the density and follows from the change-of-variables formula. Closure under independent sums is the MGF argument: by independence the MGFs multiply, and the product of two Normal MGFs is itself a Normal MGF.

Proof Sketch

For the affine claim, compute MaX+b(s)=ebsMX(as)=ebseμas+σ2a2s2/2=e(aμ+b)s+a2σ2s2/2M_{aX+b}(s) = e^{bs}M_X(as) = e^{bs}e^{\mu as + \sigma^2 a^2 s^2/2} = e^{(a\mu+b)s + a^2\sigma^2 s^2/2}, the MGF of N(aμ+b,a2σ2)\mathcal{N}(a\mu+b, a^2\sigma^2). MGF uniqueness identifies the law. For the sum claim, independence gives MX+Y(s)=MX(s)MY(s)=e(μX+μY)s+(σX2+σY2)s2/2M_{X+Y}(s) = M_X(s)M_Y(s) = e^{(\mu_X+\mu_Y)s + (\sigma_X^2+\sigma_Y^2)s^2/2}, the MGF of N(μX+μY,σX2+σY2)\mathcal{N}(\mu_X+\mu_Y,\sigma_X^2+\sigma_Y^2).

Why It Matters

Closure is what makes the sample mean of an i.i.d. Normal sample explicit: Xˉn=(1/n)XiN(μ,σ2/n)\bar X_n = (1/n)\sum X_i \sim \mathcal{N}(\mu,\sigma^2/n), with no asymptotic approximation needed. Closure also drives the central limit theorem heuristic: averages of independent random variables look approximately Normal even when the summands are not Normal, because the family is closed under the operation that defines averaging.

Failure Mode

Independence in the sum result is essential. X+X=2XN(2μ,4σ2)X + X = 2X \sim \mathcal{N}(2\mu, 4\sigma^2), not N(2μ,2σ2)\mathcal{N}(2\mu, 2\sigma^2). Sums of dependent Normals are still Normal if the joint law is multivariate Normal, but the variance is σX2+σY2+2Cov(X,Y)\sigma_X^2 + \sigma_Y^2 + 2\operatorname{Cov}(X,Y), not the independent-sum variance.

The affine and sum results combine to: every linear combination of jointly Normal random variables is Normal. This is the defining property of the multivariate Normal distribution.

Maximum Likelihood Estimation

Theorem

MLE for Mean and Variance

Statement

Given an i.i.d. sample X1,,XnX_1,\dots,X_n from N(μ,σ2)\mathcal{N}(\mu,\sigma^2), the maximum likelihood estimators are μ^=Xˉn=1ni=1nXi,σ^2=1ni=1n(XiXˉn)2.\hat\mu = \bar X_n = \frac{1}{n}\sum_{i=1}^n X_i,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar X_n)^2.

Intuition

The log-likelihood is quadratic in μ\mu and concave in σ2\sigma^2, so the score equations have a single explicit solution. The MLE for μ\mu is the sample mean; the MLE for σ2\sigma^2 divides the sum of squared deviations by nn, not n1n-1.

Proof Sketch

The log-likelihood is (μ,σ2)=n2log(2πσ2)12σ2i=1n(Xiμ)2.\ell(\mu,\sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (X_i-\mu)^2. Differentiating with respect to μ\mu and setting to zero gives (Xiμ)=0\sum(X_i-\mu) = 0, so μ^=Xˉn\hat\mu = \bar X_n. Substituting back, the profile log-likelihood in σ2\sigma^2 is (n/2)logσ2(1/(2σ2))(XiXˉn)2-(n/2)\log\sigma^2 - (1/(2\sigma^2))\sum(X_i-\bar X_n)^2, which is maximized at σ^2=(1/n)(XiXˉn)2\hat\sigma^2 = (1/n)\sum(X_i-\bar X_n)^2.

Why It Matters

The unbiased estimator of σ2\sigma^2 is S2=(1/(n1))(XiXˉn)2S^2 = (1/(n-1))\sum(X_i-\bar X_n)^2, which differs from σ^2\hat\sigma^2 by the factor n/(n1)n/(n-1). Both are consistent. The MLE is biased downward by a factor n/(n1)n/(n-1); that bias vanishes as nn\to\infty. This is the simplest example of a finite-sample bias of an MLE that disappears asymptotically.

Failure Mode

The MLE requires n2n\ge 2 to estimate σ2\sigma^2 meaningfully; for n=1n=1 the estimator σ^2=0\hat\sigma^2 = 0 is degenerate. The formula assumes the sample is i.i.d.; correlated Normal samples require the joint Normal log-likelihood and a covariance estimator.

Fisher Information

The Fisher information matrix for (μ,σ2)(\mu,\sigma^2) in the Normal model, per observation, is

I(μ,σ2)=(1/σ2001/(2σ4)).I(\mu,\sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.

The off-diagonal entry is zero, so the MLEs μ^\hat\mu and σ^2\hat\sigma^2 are asymptotically uncorrelated; in finite samples they are actually independent under Normality, which is a stronger statement (see the sample-mean-and-variance-independence theorem below). The Cramer-Rao lower bound on the variance of any unbiased estimator of μ\mu is therefore σ2/n\sigma^2/n, achieved by Xˉn\bar X_n. See Fisher information for the general computation.

Sample Mean and Sample Variance Are Independent

Theorem

Independence of Sample Mean and Sample Variance

Statement

Let X1,,XnX_1,\dots,X_n be i.i.d. N(μ,σ2)\mathcal{N}(\mu,\sigma^2), Xˉn=(1/n)Xi\bar X_n = (1/n)\sum X_i, and S2=(1/(n1))(XiXˉn)2S^2 = (1/(n-1))\sum (X_i-\bar X_n)^2. Then

  1. XˉnN(μ,σ2/n)\bar X_n \sim \mathcal{N}(\mu, \sigma^2/n).
  2. (n1)S2/σ2χn12(n-1)S^2/\sigma^2 \sim \chi^2_{n-1}.
  3. Xˉn\bar X_n and S2S^2 are independent.

Intuition

Decompose the sample vector (X1,,Xn)(X_1,\dots,X_n) into its projection onto the all-ones direction (which carries Xˉn\bar X_n) and the orthogonal complement (which carries the deviations XiXˉnX_i-\bar X_n). For i.i.d. Normal data the two projections are independent Normals, and the squared norm of an orthogonal Normal projection is a Chi-squared.

Proof Sketch

Write X=(X1,,Xn)\mathbf X = (X_1,\dots,X_n)^\top. Let UU be an orthogonal matrix with first row (1/n,,1/n)(1/\sqrt n,\dots,1/\sqrt n). Set Y=U(Xμ1)/σ\mathbf Y = U(\mathbf X - \mu\mathbf 1)/\sigma; then Y\mathbf Y is a standard Normal vector in Rn\mathbb{R}^n because orthogonal transformations preserve the standard Normal distribution. The first coordinate Y1=n(Xˉnμ)/σY_1 = \sqrt n(\bar X_n-\mu)/\sigma and the remaining Y2,,YnY_2,\dots,Y_n are independent. The sum of squared deviations equals i=2nYi2σ2\sum_{i=2}^n Y_i^2 \cdot \sigma^2, hence (n1)S2/σ2=i=2nYi2χn12(n-1)S^2/\sigma^2 = \sum_{i=2}^n Y_i^2 \sim \chi^2_{n-1}, and this is independent of Y1Y_1, hence of Xˉn\bar X_n. The full argument is Cochran's theorem.

Why It Matters

This three-part statement is the engine behind almost every classical inference for Normal samples. It identifies the law of the sample mean, the law of the sample variance (as a scaled Chi-squared), and their independence, which is what makes the t-statistic (Xˉnμ)/(S/n)(\bar X_n - \mu)/(S/\sqrt n) a Student-t random variable rather than an arbitrary ratio of dependent random variables. See Student-t distribution and t-test for the consequence.

Failure Mode

The independence of sample mean and sample variance is a special property of the Normal distribution. For non-Normal i.i.d. samples the sample mean and sample variance are not independent in finite samples; they only become asymptotically uncorrelated. Using the Student-t exact distribution outside of Normal samples replaces an exact statement with an approximation whose accuracy depends on tail weight and sample size.

Where the Normal Appears Downstream

The Normal feeds the classical sampling distributions. The connections derived in distributions atlas instantiate here:

  • Z2χ12Z^2\sim\chi^2_1 and i=1kZi2χk2\sum_{i=1}^k Z_i^2\sim\chi^2_k for independent standard Normals. See chi-squared distribution and tests.
  • Z/V/ktkZ/\sqrt{V/k}\sim t_k for ZN(0,1)Z\sim\mathcal{N}(0,1) and Vχk2V\sim\chi^2_k independent. See Student-t distribution and t-test.
  • The MLE θ^\hat\theta of any regular parametric model has, asymptotically, n(θ^θ0)N(0,I(θ0)1)\sqrt n(\hat\theta - \theta_0)\to\mathcal{N}(0, I(\theta_0)^{-1}). See maximum likelihood estimation.
  • The central limit theorem says that the standardized sample mean of any i.i.d. sample with finite variance converges to a Normal. See central limit theorem.

The Normal is also the conjugate prior for the mean of a Normal likelihood with known variance, and the conjugate prior for the precision (inverse variance) is an inverse Gamma. See bayesian estimation for the posterior update.

Common Confusions

Watch Out

The MLE for variance divides by n, not n minus one

The MLE of σ2\sigma^2 for an i.i.d. Normal sample is σ^2=(1/n)(XiXˉn)2\hat\sigma^2 = (1/n)\sum(X_i-\bar X_n)^2. The unbiased estimator S2S^2 divides by n1n-1. They are different estimators with different finite-sample properties: the MLE is biased and has smaller mean squared error; S2S^2 is unbiased. Reporting one when you computed the other inflates standard errors by n/(n1)\sqrt{n/(n-1)}.

Watch Out

The Normal MGF is finite everywhere, but that does not make it lighter than every other distribution

Sub-Gaussian distributions have MGFs finite on all of R\mathbb{R}. There are non-Normal sub-Gaussian distributions, for example any bounded random variable. Conversely, having heavy tails is not the same as having a heavy MGF; the Lognormal distribution has all moments finite but no MGF on any neighborhood of zero, because E[esX]=\mathbb{E}[e^{sX}] = \infty for every s>0s>0.

Watch Out

Closure under sums needs independence, not just zero correlation

For jointly Normal (X,Y)(X,Y), zero correlation implies independence, so closure of independent sums extends to uncorrelated sums in that joint setting. For non-Normal (X,Y)(X,Y), zero correlation does not imply independence and the sum-closure result fails. Always verify joint Normality before invoking sum closure from a correlation calculation.

Exercises

ExerciseCore

Problem

Let XN(3,4)X\sim\mathcal{N}(3, 4). Find P(1X5)\mathbb{P}(1 \le X \le 5) in terms of Φ\Phi and evaluate numerically.

ExerciseCore

Problem

Let X1,X2X_1,X_2 be independent N(0,1)\mathcal{N}(0,1) random variables. Find the distribution of Y=3X14X2Y = 3X_1 - 4X_2.

ExerciseCore

Problem

Show that the MLE estimator σ^2=(1/n)(XiXˉn)2\hat\sigma^2 = (1/n)\sum(X_i-\bar X_n)^2 has expectation (n1)σ2/n(n-1)\sigma^2/n, hence biased downward.

ExerciseAdvanced

Problem

Let XN(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^2). Show that for every ϵ>0\epsilon > 0, P(Xμϵ)exp ⁣(ϵ22σ2).\mathbb{P}(X - \mu \ge \epsilon) \le \exp\!\left(-\frac{\epsilon^2}{2\sigma^2}\right).

ExerciseAdvanced

Problem

Let X1,,XnX_1,\dots,X_n be i.i.d. N(μ,σ2)\mathcal{N}(\mu,\sigma^2) with σ2\sigma^2 known. Find the Fisher information for μ\mu from a single observation and verify that the Cramer-Rao lower bound σ2/n\sigma^2/n is achieved by Xˉn\bar X_n.

References

Canonical:

  • Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on the Normal family), Chapter 5 (Section 5.3 on sampling distributions for the Normal), and Chapter 7 (Section 7.2 on Normal MLE).
  • Lehmann and Casella, Theory of Point Estimation (1998), Chapter 1 (sufficiency for the Normal), Chapter 2 (UMVUE for μ\mu and σ2\sigma^2).
  • Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (Sections 1.2 and 1.3).

Probability:

  • Blitzstein and Hwang, Introduction to Probability (2019), Chapter 5.
  • Durrett, Probability: Theory and Examples (2019), Chapter 3 (Section 3.4 on characteristic functions of the Normal).
  • Vershynin, High-Dimensional Probability (2018), Chapter 2 (sub-Gaussian properties of the Normal).

Foundational papers:

  • Gauss, Theoria Motus Corporum Coelestium (1809), the historical introduction of the density as the error law of least-squares regression.

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

2