Normal Distribution

Sneiderman, Robby

Foundations

Normal Distribution

The Normal distribution as a parametric family: density, moment generating function, closure under affine transformations and sums, MLE for mean and variance, Fisher information, and the bridge to the Chi-squared, Student-t, and F sampling distributions.

EssentialCoreTier 1StableCore spine~60 min

For:MLStatsActuarialGeneral

Prerequisites

Common Probability Distributions Distributions Atlas Exponential Function Properties Moment Generating Functions

Prereq Map

Why This Matters

The Normal distribution is the most common parametric model in statistics, and most of its dominance comes from one fact: linear functions of independent Normal random variables stay Normal. That closure property is what makes the central limit theorem usable and what makes the sample mean of any Normal sample tractable. The classical sampling distributions, Chi-squared, Student-t, and F, are all built by combining independent Normals through squaring, root scaling, and ratios. The Normal is also the maximum-entropy distribution on $\mathbb{R}$ subject to a fixed mean and variance, which is why it appears every time you regularize a model and stop there.

Knowing the Normal well means knowing five facts cold: its density, its MGF, its closure under affine maps and independent sums, its MLE, and the joint independence of the sample mean and the sample variance for an i.i.d. Normal sample. The remainder of this page derives those five and connects each to a downstream test or model.

Definition

Normal Distribution $X \sim N (μ, σ^{2})$

A random variable $X$ has a Normal distribution with mean $\mu\in\mathbb{R}$ and variance $\sigma^2>0$ if its density is

$f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right),\qquad x\in\mathbb{R}.$

The case $\mu=0$ , $\sigma^2=1$ is the standard Normal, written $Z\sim\mathcal{N}(0,1)$ . The standard Normal CDF is denoted $\Phi$ and its density $\varphi$ .

A Normal random variable is supported on all of $\mathbb{R}$ and has positive density everywhere; no value is impossible, although values more than four or five standard deviations from the mean have very small probability. The density is symmetric about $\mu$ , has its maximum there, and falls off at a Gaussian rate, which is faster than any polynomial and faster than any subexponential.

Density Normalizes to One

Proposition

Normal Density Integrates to One

Statement

$\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)dx = 1.$

Intuition

After centering and scaling, the integrand becomes $e^{-z^2/2}/\sqrt{2\pi}$ . The classical trick squares the integral and computes the resulting double integral in polar coordinates.

Proof Sketch

Let $I = \int_{-\infty}^{\infty} e^{-z^2/2}\,dz$ . Then $I^2 = \int_{\mathbb{R}^2} e^{-(z_1^2+z_2^2)/2}\,dz_1\,dz_2 = \int_0^{2\pi}\int_0^\infty e^{-r^2/2}\,r\,dr\,d\theta = 2\pi.$ So $I = \sqrt{2\pi}$ . Substituting $z = (x-\mu)/\sigma$ in the original integral gives $\sqrt{2\pi\sigma^2}$ , which cancels the prefactor.

Why It Matters

The factor $\sqrt{2\pi\sigma^2}$ is not optional. Drop it and the density does not normalize, every downstream probability and expectation is wrong by a constant, and the log-likelihood that drives MLE for the Normal is off by the same constant. The constant matters for likelihood ratios across $\sigma^2$ values but cancels for likelihood ratios at fixed $\sigma^2$ .

Failure Mode

The polar-coordinate trick uses the rotational symmetry of $e^{-(z_1^2+z_2^2)/2}$ . No analogous trick works for non-Gaussian densities. Substituting $z = (x-\mu)/\sigma$ requires $\sigma>0$ ; the degenerate $\sigma = 0$ case is a point mass at $\mu$ , not a Normal.

report a correction →

Moments and MGF

Theorem

Normal Moment Generating Function

Statement

For $X\sim\mathcal{N}(\mu,\sigma^2)$ and every $s\in\mathbb{R}$ , $M_X(s) = \mathbb{E}[e^{sX}] = \exp\!\left(\mu s + \frac{\sigma^2 s^2}{2}\right).$

Intuition

The MGF is a quadratic in $s$ in the log scale, with linear coefficient $\mu$ and quadratic coefficient $\sigma^2/2$ . The log-MGF being quadratic is the defining property of the Normal family: any random variable whose log-MGF is quadratic is Normal.

Proof Sketch

Write $X = \mu + \sigma Z$ for $Z\sim\mathcal{N}(0,1)$ . Then $e^{sX} = e^{s\mu}e^{s\sigma Z}$ , so $M_X(s) = e^{s\mu}\int_{-\infty}^{\infty}e^{s\sigma z}\,\frac{e^{-z^2/2}}{\sqrt{2\pi}}\,dz.$ Complete the square in the exponent: $s\sigma z - z^2/2 = -(z - s\sigma)^2/2 + s^2\sigma^2/2$ . Shift the integration variable. The Gaussian integral evaluates to $\sqrt{2\pi}$ , leaving $e^{s\mu+s^2\sigma^2/2}$ .

Why It Matters

The mean and variance read directly off the MGF: $\mathbb{E}[X] = M_X'(0) = \mu$ and $\operatorname{Var}(X) = M_X''(0) - M_X'(0)^2 = \sigma^2$ . The MGF is finite for every real $s$ , which is stronger than the polynomial-moment condition; it forces sub-Gaussian tails. See sub-Gaussian random variables for the corresponding tail bound.

Failure Mode

The MGF is the exponential of a quadratic only for the Normal. If you compute the MGF of a sample and find a quadratic log-MGF you have empirical evidence for Normality, but a finite-sample MGF estimate is noisy. The MGF tool is for identification of distributions from algebraic forms, not for goodness-of-fit testing; use goodness-of-fit tests for that.

report a correction →

Closure Under Affine Maps and Independent Sums

Theorem

Affine and Sum Closure

Statement

Affine. If $X\sim\mathcal{N}(\mu,\sigma^2)$ and $a,b\in\mathbb{R}$ with $a\ne 0$ , then $aX + b \sim \mathcal{N}(a\mu + b,\, a^2\sigma^2).$
Sum of independents. If $X\sim\mathcal{N}(\mu_X,\sigma_X^2)$ and $Y\sim\mathcal{N}(\mu_Y,\sigma_Y^2)$ are independent, then $X + Y \sim \mathcal{N}(\mu_X+\mu_Y,\, \sigma_X^2+\sigma_Y^2).$

Intuition

Closure under affine maps is just a change of location and scale on the density and follows from the change-of-variables formula. Closure under independent sums is the MGF argument: by independence the MGFs multiply, and the product of two Normal MGFs is itself a Normal MGF.

Proof Sketch

For the affine claim, compute $M_{aX+b}(s) = e^{bs}M_X(as) = e^{bs}e^{\mu as + \sigma^2 a^2 s^2/2} = e^{(a\mu+b)s + a^2\sigma^2 s^2/2}$ , the MGF of $\mathcal{N}(a\mu+b, a^2\sigma^2)$ . MGF uniqueness identifies the law. For the sum claim, independence gives $M_{X+Y}(s) = M_X(s)M_Y(s) = e^{(\mu_X+\mu_Y)s + (\sigma_X^2+\sigma_Y^2)s^2/2}$ , the MGF of $\mathcal{N}(\mu_X+\mu_Y,\sigma_X^2+\sigma_Y^2)$ .

Why It Matters

Closure is what makes the sample mean of an i.i.d. Normal sample explicit: $\bar X_n = (1/n)\sum X_i \sim \mathcal{N}(\mu,\sigma^2/n)$ , with no asymptotic approximation needed. Closure also drives the central limit theorem heuristic: averages of independent random variables look approximately Normal even when the summands are not Normal, because the family is closed under the operation that defines averaging.

Failure Mode

Independence in the sum result is essential. $X + X = 2X \sim \mathcal{N}(2\mu, 4\sigma^2)$ , not $\mathcal{N}(2\mu, 2\sigma^2)$ . Sums of dependent Normals are still Normal if the joint law is multivariate Normal, but the variance is $\sigma_X^2 + \sigma_Y^2 + 2\operatorname{Cov}(X,Y)$ , not the independent-sum variance.

report a correction →

The affine and sum results combine to: every linear combination of jointly Normal random variables is Normal. This is the defining property of the multivariate Normal distribution.

Maximum Likelihood Estimation

Theorem

MLE for Mean and Variance

Statement

Given an i.i.d. sample $X_1,\dots,X_n$ from $\mathcal{N}(\mu,\sigma^2)$ , the maximum likelihood estimators are $\hat\mu = \bar X_n = \frac{1}{n}\sum_{i=1}^n X_i,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar X_n)^2.$

Intuition

The log-likelihood is quadratic in $\mu$ and concave in $\sigma^2$ , so the score equations have a single explicit solution. The MLE for $\mu$ is the sample mean; the MLE for $\sigma^2$ divides the sum of squared deviations by $n$ , not $n-1$ .

Proof Sketch

The log-likelihood is $\ell(\mu,\sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (X_i-\mu)^2.$ Differentiating with respect to $\mu$ and setting to zero gives $\sum(X_i-\mu) = 0$ , so $\hat\mu = \bar X_n$ . Substituting back, the profile log-likelihood in $\sigma^2$ is $-(n/2)\log\sigma^2 - (1/(2\sigma^2))\sum(X_i-\bar X_n)^2$ , which is maximized at $\hat\sigma^2 = (1/n)\sum(X_i-\bar X_n)^2$ .

Why It Matters

The unbiased estimator of $\sigma^2$ is $S^2 = (1/(n-1))\sum(X_i-\bar X_n)^2$ , which differs from $\hat\sigma^2$ by the factor $n/(n-1)$ . Both are consistent. The MLE is biased downward by a factor $n/(n-1)$ ; that bias vanishes as $n\to\infty$ . This is the simplest example of a finite-sample bias of an MLE that disappears asymptotically.

Failure Mode

The MLE requires $n\ge 2$ to estimate $\sigma^2$ meaningfully; for $n=1$ the estimator $\hat\sigma^2 = 0$ is degenerate. The formula assumes the sample is i.i.d.; correlated Normal samples require the joint Normal log-likelihood and a covariance estimator.

report a correction →

Fisher Information

The Fisher information matrix for $(\mu,\sigma^2)$ in the Normal model, per observation, is

$I(\mu,\sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.$

The off-diagonal entry is zero, so the MLEs $\hat\mu$ and $\hat\sigma^2$ are asymptotically uncorrelated; in finite samples they are actually independent under Normality, which is a stronger statement (see the sample-mean-and-variance-independence theorem below). The Cramer-Rao lower bound on the variance of any unbiased estimator of $\mu$ is therefore $\sigma^2/n$ , achieved by $\bar X_n$ . See Fisher information for the general computation.

Sample Mean and Sample Variance Are Independent

Theorem

Independence of Sample Mean and Sample Variance

Statement

Let $X_1,\dots,X_n$ be i.i.d. $\mathcal{N}(\mu,\sigma^2)$ , $\bar X_n = (1/n)\sum X_i$ , and $S^2 = (1/(n-1))\sum (X_i-\bar X_n)^2$ . Then

$\bar X_n \sim \mathcal{N}(\mu, \sigma^2/n)$ .
$(n-1)S^2/\sigma^2 \sim \chi^2_{n-1}$ .
$\bar X_n$ and $S^2$ are independent.

Intuition

Decompose the sample vector $(X_1,\dots,X_n)$ into its projection onto the all-ones direction (which carries $\bar X_n$ ) and the orthogonal complement (which carries the deviations $X_i-\bar X_n$ ). For i.i.d. Normal data the two projections are independent Normals, and the squared norm of an orthogonal Normal projection is a Chi-squared.

Proof Sketch

Write $\mathbf X = (X_1,\dots,X_n)^\top$ . Let $U$ be an orthogonal matrix with first row $(1/\sqrt n,\dots,1/\sqrt n)$ . Set $\mathbf Y = U(\mathbf X - \mu\mathbf 1)/\sigma$ ; then $\mathbf Y$ is a standard Normal vector in $\mathbb{R}^n$ because orthogonal transformations preserve the standard Normal distribution. The first coordinate $Y_1 = \sqrt n(\bar X_n-\mu)/\sigma$ and the remaining $Y_2,\dots,Y_n$ are independent. The sum of squared deviations equals $\sum_{i=2}^n Y_i^2 \cdot \sigma^2$ , hence $(n-1)S^2/\sigma^2 = \sum_{i=2}^n Y_i^2 \sim \chi^2_{n-1}$ , and this is independent of $Y_1$ , hence of $\bar X_n$ . The full argument is Cochran's theorem.

Why It Matters

This three-part statement is the engine behind almost every classical inference for Normal samples. It identifies the law of the sample mean, the law of the sample variance (as a scaled Chi-squared), and their independence, which is what makes the t-statistic $(\bar X_n - \mu)/(S/\sqrt n)$ a Student-t random variable rather than an arbitrary ratio of dependent random variables. See Student-t distribution and t-test for the consequence.

Failure Mode

The independence of sample mean and sample variance is a special property of the Normal distribution. For non-Normal i.i.d. samples the sample mean and sample variance are not independent in finite samples; they only become asymptotically uncorrelated. Using the Student-t exact distribution outside of Normal samples replaces an exact statement with an approximation whose accuracy depends on tail weight and sample size.

report a correction →

Where the Normal Appears Downstream

The Normal feeds the classical sampling distributions. The connections derived in distributions atlas instantiate here:

$Z^2\sim\chi^2_1$ and $\sum_{i=1}^k Z_i^2\sim\chi^2_k$ for independent standard Normals. See chi-squared distribution and tests.
$Z/\sqrt{V/k}\sim t_k$ for $Z\sim\mathcal{N}(0,1)$ and $V\sim\chi^2_k$ independent. See Student-t distribution and t-test.
The MLE $\hat\theta$ of any regular parametric model has, asymptotically, $\sqrt n(\hat\theta - \theta_0)\to\mathcal{N}(0, I(\theta_0)^{-1})$ . See maximum likelihood estimation.
The central limit theorem says that the standardized sample mean of any i.i.d. sample with finite variance converges to a Normal. See central limit theorem.

The Normal is also the conjugate prior for the mean of a Normal likelihood with known variance, and the conjugate prior for the precision (inverse variance) is an inverse Gamma. See bayesian estimation for the posterior update.

Common Confusions

Watch Out

The MLE for variance divides by n, not n minus one

The MLE of $\sigma^2$ for an i.i.d. Normal sample is $\hat\sigma^2 = (1/n)\sum(X_i-\bar X_n)^2$ . The unbiased estimator $S^2$ divides by $n-1$ . They are different estimators with different finite-sample properties: the MLE is biased and has smaller mean squared error; $S^2$ is unbiased. Reporting one when you computed the other inflates standard errors by $\sqrt{n/(n-1)}$ .

Watch Out

The Normal MGF is finite everywhere, but that does not make it lighter than every other distribution

Sub-Gaussian distributions have MGFs finite on all of $\mathbb{R}$ . There are non-Normal sub-Gaussian distributions, for example any bounded random variable. Conversely, having heavy tails is not the same as having a heavy MGF; the Lognormal distribution has all moments finite but no MGF on any neighborhood of zero, because $\mathbb{E}[e^{sX}] = \infty$ for every $s>0$ .

Watch Out

Closure under sums needs independence, not just zero correlation

For jointly Normal $(X,Y)$ , zero correlation implies independence, so closure of independent sums extends to uncorrelated sums in that joint setting. For non-Normal $(X,Y)$ , zero correlation does not imply independence and the sum-closure result fails. Always verify joint Normality before invoking sum closure from a correlation calculation.

Exercises

ExerciseCore

Problem

Let $X\sim\mathcal{N}(3, 4)$ . Find $\mathbb{P}(1 \le X \le 5)$ in terms of $\Phi$ and evaluate numerically.

ExerciseCore

Problem

Let $X_1,X_2$ be independent $\mathcal{N}(0,1)$ random variables. Find the distribution of $Y = 3X_1 - 4X_2$ .

ExerciseCore

Problem

Show that the MLE estimator $\hat\sigma^2 = (1/n)\sum(X_i-\bar X_n)^2$ has expectation $(n-1)\sigma^2/n$ , hence biased downward.

ExerciseAdvanced

Problem

Let $X\sim\mathcal{N}(\mu,\sigma^2)$ . Show that for every $\epsilon > 0$ , $\mathbb{P}(X - \mu \ge \epsilon) \le \exp\!\left(-\frac{\epsilon^2}{2\sigma^2}\right).$

ExerciseAdvanced

Problem

Let $X_1,\dots,X_n$ be i.i.d. $\mathcal{N}(\mu,\sigma^2)$ with $\sigma^2$ known. Find the Fisher information for $\mu$ from a single observation and verify that the Cramer-Rao lower bound $\sigma^2/n$ is achieved by $\bar X_n$ .

References

Canonical:

Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on the Normal family), Chapter 5 (Section 5.3 on sampling distributions for the Normal), and Chapter 7 (Section 7.2 on Normal MLE).
Lehmann and Casella, Theory of Point Estimation (1998), Chapter 1 (sufficiency for the Normal), Chapter 2 (UMVUE for $\mu$ and $\sigma^2$ ).
Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (Sections 1.2 and 1.3).

Probability:

Blitzstein and Hwang, Introduction to Probability (2019), Chapter 5.
Durrett, Probability: Theory and Examples (2019), Chapter 3 (Section 3.4 on characteristic functions of the Normal).
Vershynin, High-Dimensional Probability (2018), Chapter 2 (sub-Gaussian properties of the Normal).

Foundational papers:

Gauss, Theoria Motus Corporum Coelestium (1809), the historical introduction of the density as the error law of least-squares regression.

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Distributions Atlaslayer 0A · tier 1
Exponential Function Propertieslayer 0A · tier 1
Integration and Change of Variableslayer 0A · tier 2
Moment Generating Functionslayer 0A · tier 2

Derived topics

2

Chi-Squared Distribution and Testslayer 1 · tier 1
Student-t Distribution and t-Testlayer 1 · tier 1

Graph-backed continuations

Chi-Squared Distribution and Tests Student-t Distribution and t-Test