Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Common Probability Distributions

The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.

CoreTier 1Stable~65 min

Why This Matters

Every model you will ever train makes distributional assumptions, whether explicitly or implicitly. Logistic regression assumes Bernoulli responses. Linear regression with squared loss assumes Gaussian noise. Bayesian methods require choosing priors. Beta, Gamma, Dirichlet. If you cannot write down the PDF, compute the mean and variance, and explain when a distribution arises, you will be guessing instead of reasoning.

This page is a reference you will return to repeatedly. Know these cold.

0.0
1.0
mean = 0.00var = 1.00
μf(x)-5-3.3-1.701.73.35x

Mental Model

Distributions fall into two families by support:

  • Discrete (PMF): Bernoulli, Binomial, Poisson, Geometric, Multinomial
  • Continuous (PDF): Uniform, Gaussian, Exponential, Gamma, Beta, Chi-squared, Student-t, F, Cauchy, Dirichlet

Many are related: the Binomial is a sum of Bernoullis, the Exponential is a special Gamma, the Chi-squared is a special Gamma, the Beta is conjugate to the Bernoulli, and the Dirichlet generalizes the Beta.

Discrete Distributions

Definition

Bernoulli Distribution

A single binary trial with success probability p[0,1]p \in [0, 1].

PMF: P(X=k)=pk(1p)1kP(X = k) = p^k (1-p)^{1-k} for k{0,1}k \in \{0, 1\}

Mean: E[X]=p\mathbb{E}[X] = p

Variance: Var(X)=p(1p)\text{Var}(X) = p(1 - p)

When it arises: Binary classification labels, coin flips, any yes/no outcome. The log-likelihood of Bernoulli data gives rise to the cross-entropy loss used in logistic regression.

Definition

Binomial Distribution

The number of successes in nn independent Bernoulli(p)(p) trials.

PMF: P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k} p^k (1-p)^{n-k} for k=0,1,,nk = 0, 1, \ldots, n

Mean: E[X]=np\mathbb{E}[X] = np

Variance: Var(X)=np(1p)\text{Var}(X) = np(1 - p)

Key fact: If X1,,XnBern(p)X_1, \ldots, X_n \sim \text{Bern}(p) independently, then iXiBin(n,p)\sum_i X_i \sim \text{Bin}(n, p). As nn \to \infty with npλnp \to \lambda, the Binomial converges to Poisson(λ)\text{Poisson}(\lambda).

Definition

Poisson Distribution

Models the count of events in a fixed interval when events occur independently at a constant rate λ>0\lambda > 0.

PMF: P(X=k)=λkeλk!P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} for k=0,1,2,k = 0, 1, 2, \ldots

Mean: E[X]=λ\mathbb{E}[X] = \lambda

Variance: Var(X)=λ\text{Var}(X) = \lambda

When it arises: Count data. number of clicks, arrivals, mutations. The mean equals the variance; if your count data has variance much larger than the mean, the Poisson model is wrong (use negative binomial instead).

Definition

Geometric Distribution

The number of trials until the first success in independent Bernoulli(p)(p) trials. (Convention: XX includes the successful trial.)

PMF: P(X=k)=(1p)k1pP(X = k) = (1-p)^{k-1} p for k=1,2,3,k = 1, 2, 3, \ldots

Mean: E[X]=1/p\mathbb{E}[X] = 1/p

Variance: Var(X)=(1p)/p2\text{Var}(X) = (1-p)/p^2

Key property: The Geometric distribution is the only discrete distribution with the memoryless property: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t).

Definition

Multinomial Distribution

The multivariate generalization of the Binomial. In nn trials, each of KK categories occurs with probability pkp_k, and XkX_k counts category kk.

PMF: P(X1=x1,,XK=xK)=n!x1!xK!p1x1pKxKP(X_1 = x_1, \ldots, X_K = x_K) = \frac{n!}{x_1! \cdots x_K!} p_1^{x_1} \cdots p_K^{x_K}

where kxk=n\sum_k x_k = n and kpk=1\sum_k p_k = 1.

Marginals: Each XkBin(n,pk)X_k \sim \text{Bin}(n, p_k).

When it arises: Multi-class classification, bag-of-words models, topic models. The softmax output of a neural network parameterizes a single-trial Multinomial (i.e., a Categorical distribution).

Continuous Distributions

Definition

Uniform Distribution

Constant density on the interval [a,b][a, b].

PDF: f(x)=1baf(x) = \frac{1}{b - a} for x[a,b]x \in [a, b], zero otherwise.

Mean: E[X]=a+b2\mathbb{E}[X] = \frac{a + b}{2}

Variance: Var(X)=(ba)212\text{Var}(X) = \frac{(b - a)^2}{12}

When it arises: Maximum entropy distribution on a bounded interval with no other constraints. Used in random initialization, random search, and as the base distribution for inverse transform sampling.

Definition

Gaussian (Normal) Distribution

The most important continuous distribution. Parameterized by mean μ\mu and variance σ2>0\sigma^2 > 0.

PDF: f(x)=12πσ2exp ⁣((xμ)22σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Mean: E[X]=μ\mathbb{E}[X] = \mu

Variance: Var(X)=σ2\text{Var}(X) = \sigma^2

MGF: E[etX]=exp(μt+σ2t2/2)\mathbb{E}[e^{tX}] = \exp(\mu t + \sigma^2 t^2 / 2)

The normalizing constant 1/2πσ21/\sqrt{2\pi\sigma^2} ensures f(x)dx=1\int_{-\infty}^{\infty} f(x)\,dx = 1. This requires the Gaussian integral ex2/2dx=2π\int_{-\infty}^{\infty} e^{-x^2/2}\,dx = \sqrt{2\pi}.

Definition

Multivariate Gaussian

For XRdX \in \mathbb{R}^d with mean vector μRd\mu \in \mathbb{R}^d and positive definite covariance matrix ΣRd×d\Sigma \in \mathbb{R}^{d \times d}:

PDF: f(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))f(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right)

Key properties:

  • Marginals are Gaussian: if XN(μ,Σ)X \sim \mathcal{N}(\mu, \Sigma), any subset of coordinates is also multivariate Gaussian.
  • For jointly Gaussian distributions, uncorrelated implies independent. Marginal Gaussianity alone does not suffice. A standard counterexample: XN(0,1)X \sim \mathcal{N}(0, 1) and Y=X1{Xc}X1{X>c}Y = X \cdot \mathbb{1}\{|X| \leq c\} - X \cdot \mathbb{1}\{|X| > c\}. Both marginals are Gaussian and the pair is uncorrelated for an appropriate cc, yet XX and YY are dependent (since X=Y|X| = |Y|).
  • Affine transformations preserve Gaussianity: AX+bN(Aμ+b,AΣA)AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^\top).

When it arises: Central limit theorem, Bayesian linear regression posterior, Gaussian processes, VAE latent spaces.

Definition

Exponential Distribution

Models waiting times between events in a Poisson process with rate λ>0\lambda > 0.

PDF: f(x)=λeλxf(x) = \lambda e^{-\lambda x} for x0x \geq 0

Mean: E[X]=1/λ\mathbb{E}[X] = 1/\lambda

Variance: Var(X)=1/λ2\text{Var}(X) = 1/\lambda^2

Key property: Memoryless: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t). This is the only continuous distribution with this property.

Relationship: Exp(λ)=Gamma(1,λ)\text{Exp}(\lambda) = \text{Gamma}(1, \lambda).

Definition

Gamma Distribution

Generalizes the Exponential. Shape α>0\alpha > 0, rate β>0\beta > 0.

PDF: f(x)=βαΓ(α)xα1eβxf(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} for x>0x > 0

where Γ(α)=0tα1etdt\Gamma(\alpha) = \int_0^\infty t^{\alpha-1} e^{-t}\,dt is the Gamma function.

Mean: E[X]=α/β\mathbb{E}[X] = \alpha/\beta

Variance: Var(X)=α/β2\text{Var}(X) = \alpha/\beta^2

Special cases:

  • Gamma(1,λ)=Exp(λ)\text{Gamma}(1, \lambda) = \text{Exp}(\lambda)
  • Gamma(n/2,1/2)=χ2(n)\text{Gamma}(n/2, 1/2) = \chi^2(n)
  • Sum of nn independent Exp(β)\text{Exp}(\beta) variables is Gamma(n,β)\text{Gamma}(n, \beta)

When it arises: Conjugate prior for the Poisson rate and the precision (inverse variance) of a Gaussian. Models positive quantities with skew.

Definition

Beta Distribution

Defined on [0,1][0, 1] with shape parameters α,β>0\alpha, \beta > 0.

PDF: f(x)=xα1(1x)β1B(α,β)f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} for x[0,1]x \in [0, 1]

where B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}.

Mean: E[X]=αα+β\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta}

Variance: Var(X)=αβ(α+β)2(α+β+1)\text{Var}(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}

Key fact: The Beta is the conjugate prior for the Bernoulli/Binomial parameter pp. If pBeta(α,β)p \sim \text{Beta}(\alpha, \beta) and XpBin(n,p)X \mid p \sim \text{Bin}(n, p) with kk successes, then pXBeta(α+k,β+nk)p \mid X \sim \text{Beta}(\alpha + k, \beta + n - k).

Special cases: Beta(1,1)=Unif(0,1)\text{Beta}(1, 1) = \text{Unif}(0, 1).

Definition

Chi-Squared Distribution

The sum of kk independent squared standard normals. Degrees of freedom kk.

PDF: f(x)=xk/21ex/22k/2Γ(k/2)f(x) = \frac{x^{k/2 - 1} e^{-x/2}}{2^{k/2}\Gamma(k/2)} for x>0x > 0

Mean: E[X]=k\mathbb{E}[X] = k

Variance: Var(X)=2k\text{Var}(X) = 2k

Relationship: χ2(k)=Gamma(k/2,1/2)\chi^2(k) = \text{Gamma}(k/2, 1/2).

If Z1,,ZkN(0,1)Z_1, \ldots, Z_k \sim \mathcal{N}(0,1) independently, then iZi2χ2(k)\sum_i Z_i^2 \sim \chi^2(k).

When it arises: Hypothesis testing (goodness-of-fit, likelihood ratio tests), analysis of variance. The sample variance of Gaussian data is proportional to a χ2\chi^2. Important in concentration: the χ2\chi^2 is sub-exponential but not sub-Gaussian.

Definition

Student-t Distribution

Arises when estimating the mean of a Gaussian population with unknown variance. Degrees of freedom ν>0\nu > 0.

PDF: f(x)=Γ((ν+1)/2)νπΓ(ν/2)(1+x2ν)(ν+1)/2f(x) = \frac{\Gamma((\nu+1)/2)}{\sqrt{\nu\pi}\,\Gamma(\nu/2)}\left(1 + \frac{x^2}{\nu}\right)^{-(\nu+1)/2}

Mean: E[X]=0\mathbb{E}[X] = 0 for ν>1\nu > 1 (undefined for ν1\nu \leq 1)

Variance: Var(X)=νν2\text{Var}(X) = \frac{\nu}{\nu - 2} for ν>2\nu > 2

Construction: If ZN(0,1)Z \sim \mathcal{N}(0,1) and Vχ2(ν)V \sim \chi^2(\nu) are independent, then T=Z/V/νt(ν)T = Z / \sqrt{V/\nu} \sim t(\nu).

Key fact: Heavier tails than the Gaussian. As ν\nu \to \infty, t(ν)N(0,1)t(\nu) \to \mathcal{N}(0, 1). For small ν\nu, the heavy tails accommodate outliers. This is why the Student-t is used in robust statistics.

Definition

F Distribution

The ratio of two independent scaled chi-squared variables.

Construction: If Uχ2(d1)U \sim \chi^2(d_1) and Vχ2(d2)V \sim \chi^2(d_2) are independent, then F=U/d1V/d2F(d1,d2)F = \frac{U/d_1}{V/d_2} \sim F(d_1, d_2).

Mean: E[X]=d2d22\mathbb{E}[X] = \frac{d_2}{d_2 - 2} for d2>2d_2 > 2

When it arises: ANOVA, comparing two model fits (F-test for nested models), testing whether a group of regression coefficients is jointly zero.

Relationship: If Tt(ν)T \sim t(\nu), then T2F(1,ν)T^2 \sim F(1, \nu).

Definition

Cauchy Distribution

Location x0x_0, scale γ>0\gamma > 0.

PDF: f(x)=1πγ[1+(xx0γ)2]f(x) = \frac{1}{\pi\gamma\left[1 + \left(\frac{x - x_0}{\gamma}\right)^2\right]}

Mean: Does not exist (the integral diverges).

Variance: Does not exist.

Key fact: The Cauchy is t(1)t(1). The Student-t with one degree of freedom. It is the standard example of a distribution with no mean. The sample mean of nn i.i.d. Cauchy variables has the same distribution as a single observation. The law of large numbers does not apply. This is why finite moments matter for concentration inequalities.

Definition

Dirichlet Distribution

The multivariate generalization of the Beta, defined on the (K1)(K-1)-simplex {xRK:xk0,kxk=1}\{x \in \mathbb{R}^K : x_k \geq 0, \sum_k x_k = 1\}.

PDF: f(x1,,xK)=Γ(kαk)kΓ(αk)k=1Kxkαk1f(x_1, \ldots, x_K) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K x_k^{\alpha_k - 1}

Marginals: Each XkBeta(αk,jkαj)X_k \sim \text{Beta}(\alpha_k, \sum_{j \neq k} \alpha_j).

Mean: E[Xk]=αk/jαj\mathbb{E}[X_k] = \alpha_k / \sum_j \alpha_j

Key fact: Conjugate prior for the Multinomial/Categorical. If θDir(α)\theta \sim \text{Dir}(\alpha) and XθMult(n,θ)X \mid \theta \sim \text{Mult}(n, \theta) with counts c1,,cKc_1, \ldots, c_K, then θXDir(α1+c1,,αK+cK)\theta \mid X \sim \text{Dir}(\alpha_1 + c_1, \ldots, \alpha_K + c_K).

When it arises: Latent Dirichlet Allocation (LDA), Bayesian multi-class models, any model requiring a prior over probability vectors.

Relationships Between Distributions

The major distributions form a web of connections:

  • Bernoulli \to Binomial: sum of nn i.i.d. Bernoullis
  • Binomial \to Poisson: limit as nn \to \infty, p0p \to 0, npλnp \to \lambda
  • Exponential \to Gamma: Gamma(1,λ)=Exp(λ)\text{Gamma}(1, \lambda) = \text{Exp}(\lambda); sum of nn Exponentials is Gamma(n,λ)\text{Gamma}(n, \lambda)
  • Gamma \to Chi-squared: χ2(k)=Gamma(k/2,1/2)\chi^2(k) = \text{Gamma}(k/2, 1/2)
  • Gaussian \to Chi-squared: sum of squared standard normals
  • Gaussian + Chi-squared \to Student-t: ratio construction
  • Chi-squared + Chi-squared \to F: ratio of two scaled chi-squareds
  • Beta \leftrightarrow Bernoulli: conjugate prior relationship
  • Beta \to Dirichlet: multivariate generalization
  • Dirichlet \leftrightarrow Multinomial: conjugate prior relationship
  • Gaussian + Gaussian prior \to Gaussian posterior: self-conjugacy

The distributions of sorted samples from these families (the kk-th smallest value from nn draws) are studied in order statistics, which connects to extreme value theory and nonparametric inference.

Main Theorems

Proposition

The Gaussian Normalizing Constant

Statement

The Gaussian integral evaluates to:

ex2/2dx=2π\int_{-\infty}^{\infty} e^{-x^2/2}\,dx = \sqrt{2\pi}

Consequently, f(x)=12πσ2e(xμ)2/(2σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)} integrates to 1 and is a valid PDF.

Intuition

The 1/2π1/\sqrt{2\pi} in the Gaussian PDF is not arbitrary. It is the only constant that makes the total probability equal to 1. Without it, the "bell curve" would not be a proper probability distribution.

Proof Sketch

Let I=ex2/2dxI = \int_{-\infty}^{\infty} e^{-x^2/2}\,dx. Then:

I2=e(x2+y2)/2dxdyI^2 = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} e^{-(x^2+y^2)/2}\,dx\,dy

Switch to polar coordinates: x=rcosθx = r\cos\theta, y=rsinθy = r\sin\theta, dxdy=rdrdθdx\,dy = r\,dr\,d\theta:

I2=02π0rer2/2drdθ=2π[er2/2]0=2πI^2 = \int_0^{2\pi}\int_0^{\infty} r\,e^{-r^2/2}\,dr\,d\theta = 2\pi \cdot \left[-e^{-r^2/2}\right]_0^{\infty} = 2\pi

Therefore I=2πI = \sqrt{2\pi}.

Why It Matters

This is one of the most important calculations in all of probability. The polar coordinates trick is a canonical proof technique. Understanding where 2π\sqrt{2\pi} comes from demystifies the Gaussian and explains why it appears in entropy formulas, the CLT, and information-theoretic quantities.

Failure Mode

Students sometimes assume the Gaussian integrates to 1 "by definition" without understanding the calculation. This leads to confusion when computing marginals of multivariate Gaussians or when normalizing constants matter (e.g., in Bayesian inference and partition functions).

Canonical Examples

Example

Beta-Bernoulli conjugacy in action

Suppose you observe k=7k = 7 heads in n=10n = 10 coin flips. With a Beta(1,1)\text{Beta}(1, 1) prior (uniform on [0,1][0, 1]), the posterior is Beta(1+7,1+3)=Beta(8,4)\text{Beta}(1 + 7, 1 + 3) = \text{Beta}(8, 4).

The posterior mean is 8/120.6678/12 \approx 0.667, slightly shrunk from the MLE of 7/10=0.77/10 = 0.7 toward 0.50.5. The posterior mode is (81)/(8+42)=0.7(8-1)/(8+4-2) = 0.7, which equals the MLE. With a stronger prior Beta(10,10)\text{Beta}(10, 10), the posterior would be Beta(17,13)\text{Beta}(17, 13) with mean 17/300.56717/30 \approx 0.567, with more shrinkage toward the prior mean of 0.50.5.

Example

Chi-squared as sum of squared normals

If Z1,Z2,Z3N(0,1)Z_1, Z_2, Z_3 \sim \mathcal{N}(0, 1) independently, then W=Z12+Z22+Z32χ2(3)W = Z_1^2 + Z_2^2 + Z_3^2 \sim \chi^2(3) with E[W]=3\mathbb{E}[W] = 3 and Var(W)=6\text{Var}(W) = 6.

Note: each Zi2Z_i^2 has mean 1 and variance 2. The χ2\chi^2 is sub-exponential but not sub-Gaussian because Zi2Z_i^2 is a product of two sub-Gaussian variables (and products of sub-Gaussians are sub-exponential).

Common Confusions

Watch Out

The normalizing constant matters

Students sometimes write the Gaussian PDF without 1/2πσ21/\sqrt{2\pi\sigma^2} or confuse it with 1/2π1/\sqrt{2\pi} (forgetting the σ\sigma). The full constant is 1/2πσ2=1/(σ2π)1/\sqrt{2\pi\sigma^2} = 1/(\sigma\sqrt{2\pi}). Getting this wrong means your PDF does not integrate to 1, your log-likelihoods are wrong, and your MLE derivations break. In the multivariate case, the constant involves Σ1/2|\Sigma|^{1/2} and (2π)d/2(2\pi)^{d/2}.

Watch Out

Gamma parameterization varies between sources

Some sources use the rate parameterization Gamma(α,β)\text{Gamma}(\alpha, \beta) with PDF proportional to xα1eβxx^{\alpha-1}e^{-\beta x}, while others use the scale parameterization Gamma(α,θ)\text{Gamma}(\alpha, \theta) with PDF proportional to xα1ex/θx^{\alpha-1}e^{-x/\theta} where θ=1/β\theta = 1/\beta. Always check which convention a textbook or library uses. NumPy and SciPy use the scale parameterization. This page uses rate.

Watch Out

The Cauchy has no mean, not a mean of zero

The standard Cauchy distribution is symmetric about zero, so you might think its mean is zero. It is not. The mean does not exist because xf(x)dx=\int |x| f(x)\,dx = \infty. The "center" of the Cauchy is the median (and mode), which is x0x_0, but the expectation is undefined.

Summary

  • Bernoulli/Binomial/Multinomial: discrete counts; cross-entropy loss comes from Bernoulli likelihood
  • Poisson: counts with mean = variance; approximates Binomial for rare events
  • Gaussian: the universal limit (CLT); maximizes entropy for given mean and variance
  • Exponential/Gamma: positive continuous; memoryless property for Exponential
  • Beta/Dirichlet: priors on probabilities; conjugate to Bernoulli/Multinomial
  • Chi-squared: sum of squared normals; appears in hypothesis tests and variance estimates
  • Student-t: Gaussian with unknown variance; heavier tails accommodate outliers
  • Cauchy: the pathological case. no moments, LLN fails

Exercises

ExerciseCore

Problem

If X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) independently and Xˉ=1niXi\bar{X} = \frac{1}{n}\sum_i X_i, what is the distribution of Xˉ\bar{X}? What about Xˉμσ/n\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}?

ExerciseCore

Problem

Show that Beta(1,1)=Unif(0,1)\text{Beta}(1, 1) = \text{Unif}(0, 1) by writing out the Beta PDF with α=β=1\alpha = \beta = 1.

ExerciseAdvanced

Problem

Show that if XGamma(α1,β)X \sim \text{Gamma}(\alpha_1, \beta) and YGamma(α2,β)Y \sim \text{Gamma}(\alpha_2, \beta) are independent, then X+YGamma(α1+α2,β)X + Y \sim \text{Gamma}(\alpha_1 + \alpha_2, \beta).

Related Comparisons

References

Canonical:

  • Casella & Berger, Statistical Inference (2002), Chapters 3-4
  • DeGroot & Schervish, Probability and Statistics (2012), Chapters 5-6

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 2

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 2

  • Durrett, Probability: Theory and Examples 5th ed (2019), Chapters 2-3

  • Billingsley, Probability and Measure 3rd ed (1995), Chapters 20-21

Next Topics

Building on these distribution foundations:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics