Foundations
Normal Distribution
The Normal distribution as a parametric family: density, moment generating function, closure under affine transformations and sums, MLE for mean and variance, Fisher information, and the bridge to the Chi-squared, Student-t, and F sampling distributions.
Prerequisites
Why This Matters
The Normal distribution is the most common parametric model in statistics, and most of its dominance comes from one fact: linear functions of independent Normal random variables stay Normal. That closure property is what makes the central limit theorem usable and what makes the sample mean of any Normal sample tractable. The classical sampling distributions, Chi-squared, Student-t, and F, are all built by combining independent Normals through squaring, root scaling, and ratios. The Normal is also the maximum-entropy distribution on subject to a fixed mean and variance, which is why it appears every time you regularize a model and stop there.
Knowing the Normal well means knowing five facts cold: its density, its MGF, its closure under affine maps and independent sums, its MLE, and the joint independence of the sample mean and the sample variance for an i.i.d. Normal sample. The remainder of this page derives those five and connects each to a downstream test or model.
Definition
Normal Distribution
A random variable has a Normal distribution with mean and variance if its density is
The case , is the standard Normal, written . The standard Normal CDF is denoted and its density .
A Normal random variable is supported on all of and has positive density everywhere; no value is impossible, although values more than four or five standard deviations from the mean have very small probability. The density is symmetric about , has its maximum there, and falls off at a Gaussian rate, which is faster than any polynomial and faster than any subexponential.
Density Normalizes to One
Normal Density Integrates to One
Statement
Intuition
After centering and scaling, the integrand becomes . The classical trick squares the integral and computes the resulting double integral in polar coordinates.
Proof Sketch
Let . Then So . Substituting in the original integral gives , which cancels the prefactor.
Why It Matters
The factor is not optional. Drop it and the density does not normalize, every downstream probability and expectation is wrong by a constant, and the log-likelihood that drives MLE for the Normal is off by the same constant. The constant matters for likelihood ratios across values but cancels for likelihood ratios at fixed .
Failure Mode
The polar-coordinate trick uses the rotational symmetry of . No analogous trick works for non-Gaussian densities. Substituting requires ; the degenerate case is a point mass at , not a Normal.
Moments and MGF
Normal Moment Generating Function
Statement
For and every ,
Intuition
The MGF is a quadratic in in the log scale, with linear coefficient and quadratic coefficient . The log-MGF being quadratic is the defining property of the Normal family: any random variable whose log-MGF is quadratic is Normal.
Proof Sketch
Write for . Then , so Complete the square in the exponent: . Shift the integration variable. The Gaussian integral evaluates to , leaving .
Why It Matters
The mean and variance read directly off the MGF: and . The MGF is finite for every real , which is stronger than the polynomial-moment condition; it forces sub-Gaussian tails. See sub-Gaussian random variables for the corresponding tail bound.
Failure Mode
The MGF is the exponential of a quadratic only for the Normal. If you compute the MGF of a sample and find a quadratic log-MGF you have empirical evidence for Normality, but a finite-sample MGF estimate is noisy. The MGF tool is for identification of distributions from algebraic forms, not for goodness-of-fit testing; use goodness-of-fit tests for that.
Closure Under Affine Maps and Independent Sums
Affine and Sum Closure
Statement
- Affine. If and with , then
- Sum of independents. If and are independent, then
Intuition
Closure under affine maps is just a change of location and scale on the density and follows from the change-of-variables formula. Closure under independent sums is the MGF argument: by independence the MGFs multiply, and the product of two Normal MGFs is itself a Normal MGF.
Proof Sketch
For the affine claim, compute , the MGF of . MGF uniqueness identifies the law. For the sum claim, independence gives , the MGF of .
Why It Matters
Closure is what makes the sample mean of an i.i.d. Normal sample explicit: , with no asymptotic approximation needed. Closure also drives the central limit theorem heuristic: averages of independent random variables look approximately Normal even when the summands are not Normal, because the family is closed under the operation that defines averaging.
Failure Mode
Independence in the sum result is essential. , not . Sums of dependent Normals are still Normal if the joint law is multivariate Normal, but the variance is , not the independent-sum variance.
The affine and sum results combine to: every linear combination of jointly Normal random variables is Normal. This is the defining property of the multivariate Normal distribution.
Maximum Likelihood Estimation
MLE for Mean and Variance
Statement
Given an i.i.d. sample from , the maximum likelihood estimators are
Intuition
The log-likelihood is quadratic in and concave in , so the score equations have a single explicit solution. The MLE for is the sample mean; the MLE for divides the sum of squared deviations by , not .
Proof Sketch
The log-likelihood is Differentiating with respect to and setting to zero gives , so . Substituting back, the profile log-likelihood in is , which is maximized at .
Why It Matters
The unbiased estimator of is , which differs from by the factor . Both are consistent. The MLE is biased downward by a factor ; that bias vanishes as . This is the simplest example of a finite-sample bias of an MLE that disappears asymptotically.
Failure Mode
The MLE requires to estimate meaningfully; for the estimator is degenerate. The formula assumes the sample is i.i.d.; correlated Normal samples require the joint Normal log-likelihood and a covariance estimator.
Fisher Information
The Fisher information matrix for in the Normal model, per observation, is
The off-diagonal entry is zero, so the MLEs and are asymptotically uncorrelated; in finite samples they are actually independent under Normality, which is a stronger statement (see the sample-mean-and-variance-independence theorem below). The Cramer-Rao lower bound on the variance of any unbiased estimator of is therefore , achieved by . See Fisher information for the general computation.
Sample Mean and Sample Variance Are Independent
Independence of Sample Mean and Sample Variance
Statement
Let be i.i.d. , , and . Then
- .
- .
- and are independent.
Intuition
Decompose the sample vector into its projection onto the all-ones direction (which carries ) and the orthogonal complement (which carries the deviations ). For i.i.d. Normal data the two projections are independent Normals, and the squared norm of an orthogonal Normal projection is a Chi-squared.
Proof Sketch
Write . Let be an orthogonal matrix with first row . Set ; then is a standard Normal vector in because orthogonal transformations preserve the standard Normal distribution. The first coordinate and the remaining are independent. The sum of squared deviations equals , hence , and this is independent of , hence of . The full argument is Cochran's theorem.
Why It Matters
This three-part statement is the engine behind almost every classical inference for Normal samples. It identifies the law of the sample mean, the law of the sample variance (as a scaled Chi-squared), and their independence, which is what makes the t-statistic a Student-t random variable rather than an arbitrary ratio of dependent random variables. See Student-t distribution and t-test for the consequence.
Failure Mode
The independence of sample mean and sample variance is a special property of the Normal distribution. For non-Normal i.i.d. samples the sample mean and sample variance are not independent in finite samples; they only become asymptotically uncorrelated. Using the Student-t exact distribution outside of Normal samples replaces an exact statement with an approximation whose accuracy depends on tail weight and sample size.
Where the Normal Appears Downstream
The Normal feeds the classical sampling distributions. The connections derived in distributions atlas instantiate here:
- and for independent standard Normals. See chi-squared distribution and tests.
- for and independent. See Student-t distribution and t-test.
- The MLE of any regular parametric model has, asymptotically, . See maximum likelihood estimation.
- The central limit theorem says that the standardized sample mean of any i.i.d. sample with finite variance converges to a Normal. See central limit theorem.
The Normal is also the conjugate prior for the mean of a Normal likelihood with known variance, and the conjugate prior for the precision (inverse variance) is an inverse Gamma. See bayesian estimation for the posterior update.
Common Confusions
The MLE for variance divides by n, not n minus one
The MLE of for an i.i.d. Normal sample is . The unbiased estimator divides by . They are different estimators with different finite-sample properties: the MLE is biased and has smaller mean squared error; is unbiased. Reporting one when you computed the other inflates standard errors by .
The Normal MGF is finite everywhere, but that does not make it lighter than every other distribution
Sub-Gaussian distributions have MGFs finite on all of . There are non-Normal sub-Gaussian distributions, for example any bounded random variable. Conversely, having heavy tails is not the same as having a heavy MGF; the Lognormal distribution has all moments finite but no MGF on any neighborhood of zero, because for every .
Closure under sums needs independence, not just zero correlation
For jointly Normal , zero correlation implies independence, so closure of independent sums extends to uncorrelated sums in that joint setting. For non-Normal , zero correlation does not imply independence and the sum-closure result fails. Always verify joint Normality before invoking sum closure from a correlation calculation.
Exercises
Problem
Let . Find in terms of and evaluate numerically.
Problem
Let be independent random variables. Find the distribution of .
Problem
Show that the MLE estimator has expectation , hence biased downward.
Problem
Let . Show that for every ,
Problem
Let be i.i.d. with known. Find the Fisher information for from a single observation and verify that the Cramer-Rao lower bound is achieved by .
References
Canonical:
- Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on the Normal family), Chapter 5 (Section 5.3 on sampling distributions for the Normal), and Chapter 7 (Section 7.2 on Normal MLE).
- Lehmann and Casella, Theory of Point Estimation (1998), Chapter 1 (sufficiency for the Normal), Chapter 2 (UMVUE for and ).
- Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (Sections 1.2 and 1.3).
Probability:
- Blitzstein and Hwang, Introduction to Probability (2019), Chapter 5.
- Durrett, Probability: Theory and Examples (2019), Chapter 3 (Section 3.4 on characteristic functions of the Normal).
- Vershynin, High-Dimensional Probability (2018), Chapter 2 (sub-Gaussian properties of the Normal).
Foundational papers:
- Gauss, Theoria Motus Corporum Coelestium (1809), the historical introduction of the density as the error law of least-squares regression.
Last reviewed: May 11, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Common Probability Distributionslayer 0A · tier 1
- Distributions Atlaslayer 0A · tier 1
- Exponential Function Propertieslayer 0A · tier 1
- Integration and Change of Variableslayer 0A · tier 2
- Moment Generating Functionslayer 0A · tier 2
Derived topics
2- Chi-Squared Distribution and Testslayer 1 · tier 1
- Student-t Distribution and t-Testlayer 1 · tier 1
Graph-backed continuations