Foundations
Common Probability Distributions
The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.
Prerequisites
Why This Matters
Every model you will ever train makes distributional assumptions, whether explicitly or implicitly. Logistic regression assumes Bernoulli responses. Linear regression with squared loss assumes Gaussian noise. Bayesian methods require choosing priors. Beta, Gamma, Dirichlet. If you cannot write down the PDF, compute the mean and variance, and explain when a distribution arises, you will be guessing instead of reasoning.
This page is a reference you will return to repeatedly. Know these cold.
Mental Model
Distributions fall into two families by support:
- Discrete (PMF): Bernoulli, Binomial, Poisson, Geometric, Multinomial
- Continuous (PDF): Uniform, Gaussian, Exponential, Gamma, Beta, Chi-squared, Student-t, F, Cauchy, Dirichlet
Many are related: the Binomial is a sum of Bernoullis, the Exponential is a special Gamma, the Chi-squared is a special Gamma, the Beta is conjugate to the Bernoulli, and the Dirichlet generalizes the Beta.
Discrete Distributions
Bernoulli Distribution
A single binary trial with success probability .
PMF: for
Mean:
Variance:
When it arises: Binary classification labels, coin flips, any yes/no outcome. The log-likelihood of Bernoulli data gives rise to the cross-entropy loss used in logistic regression.
Binomial Distribution
The number of successes in independent Bernoulli trials.
PMF: for
Mean:
Variance:
Key fact: If independently, then . As with , the Binomial converges to .
Poisson Distribution
Models the count of events in a fixed interval when events occur independently at a constant rate .
PMF: for
Mean:
Variance:
When it arises: Count data. number of clicks, arrivals, mutations. The mean equals the variance; if your count data has variance much larger than the mean, the Poisson model is wrong (use negative binomial instead).
Geometric Distribution
The number of trials until the first success in independent Bernoulli trials. (Convention: includes the successful trial.)
PMF: for
Mean:
Variance:
Key property: The Geometric distribution is the only discrete distribution with the memoryless property: .
Multinomial Distribution
The multivariate generalization of the Binomial. In trials, each of categories occurs with probability , and counts category .
PMF:
where and .
Marginals: Each .
When it arises: Multi-class classification, bag-of-words models, topic models. The softmax output of a neural network parameterizes a single-trial Multinomial (i.e., a Categorical distribution).
Continuous Distributions
Uniform Distribution
Constant density on the interval .
PDF: for , zero otherwise.
Mean:
Variance:
When it arises: Maximum entropy distribution on a bounded interval with no other constraints. Used in random initialization, random search, and as the base distribution for inverse transform sampling.
Gaussian (Normal) Distribution
The most important continuous distribution. Parameterized by mean and variance .
PDF:
Mean:
Variance:
MGF:
The normalizing constant ensures . This requires the Gaussian integral .
Multivariate Gaussian
For with mean vector and positive definite covariance matrix :
PDF:
Key properties:
- Marginals are Gaussian: if , any subset of coordinates is also multivariate Gaussian.
- For jointly Gaussian distributions, uncorrelated implies independent. Marginal Gaussianity alone does not suffice. A standard counterexample: and . Both marginals are Gaussian and the pair is uncorrelated for an appropriate , yet and are dependent (since ).
- Affine transformations preserve Gaussianity: .
When it arises: Central limit theorem, Bayesian linear regression posterior, Gaussian processes, VAE latent spaces.
Exponential Distribution
Models waiting times between events in a Poisson process with rate .
PDF: for
Mean:
Variance:
Key property: Memoryless: . This is the only continuous distribution with this property.
Relationship: .
Gamma Distribution
Generalizes the Exponential. Shape , rate .
PDF: for
where is the Gamma function.
Mean:
Variance:
Special cases:
- Sum of independent variables is
When it arises: Conjugate prior for the Poisson rate and the precision (inverse variance) of a Gaussian. Models positive quantities with skew.
Beta Distribution
Defined on with shape parameters .
PDF: for
where .
Mean:
Variance:
Key fact: The Beta is the conjugate prior for the Bernoulli/Binomial parameter . If and with successes, then .
Special cases: .
Chi-Squared Distribution
The sum of independent squared standard normals. Degrees of freedom .
PDF: for
Mean:
Variance:
Relationship: .
If independently, then .
When it arises: Hypothesis testing (goodness-of-fit, likelihood ratio tests), analysis of variance. The sample variance of Gaussian data is proportional to a . Important in concentration: the is sub-exponential but not sub-Gaussian.
Student-t Distribution
Arises when estimating the mean of a Gaussian population with unknown variance. Degrees of freedom .
PDF:
Mean: for (undefined for )
Variance: for
Construction: If and are independent, then .
Key fact: Heavier tails than the Gaussian. As , . For small , the heavy tails accommodate outliers. This is why the Student-t is used in robust statistics.
F Distribution
The ratio of two independent scaled chi-squared variables.
Construction: If and are independent, then .
Mean: for
When it arises: ANOVA, comparing two model fits (F-test for nested models), testing whether a group of regression coefficients is jointly zero.
Relationship: If , then .
Cauchy Distribution
Location , scale .
PDF:
Mean: Does not exist (the integral diverges).
Variance: Does not exist.
Key fact: The Cauchy is . The Student-t with one degree of freedom. It is the standard example of a distribution with no mean. The sample mean of i.i.d. Cauchy variables has the same distribution as a single observation. The law of large numbers does not apply. This is why finite moments matter for concentration inequalities.
Dirichlet Distribution
The multivariate generalization of the Beta, defined on the -simplex .
PDF:
Marginals: Each .
Mean:
Key fact: Conjugate prior for the Multinomial/Categorical. If and with counts , then .
When it arises: Latent Dirichlet Allocation (LDA), Bayesian multi-class models, any model requiring a prior over probability vectors.
Relationships Between Distributions
The major distributions form a web of connections:
- Bernoulli Binomial: sum of i.i.d. Bernoullis
- Binomial Poisson: limit as , ,
- Exponential Gamma: ; sum of Exponentials is
- Gamma Chi-squared:
- Gaussian Chi-squared: sum of squared standard normals
- Gaussian + Chi-squared Student-t: ratio construction
- Chi-squared + Chi-squared F: ratio of two scaled chi-squareds
- Beta Bernoulli: conjugate prior relationship
- Beta Dirichlet: multivariate generalization
- Dirichlet Multinomial: conjugate prior relationship
- Gaussian + Gaussian prior Gaussian posterior: self-conjugacy
The distributions of sorted samples from these families (the -th smallest value from draws) are studied in order statistics, which connects to extreme value theory and nonparametric inference.
Main Theorems
The Gaussian Normalizing Constant
Statement
The Gaussian integral evaluates to:
Consequently, integrates to 1 and is a valid PDF.
Intuition
The in the Gaussian PDF is not arbitrary. It is the only constant that makes the total probability equal to 1. Without it, the "bell curve" would not be a proper probability distribution.
Proof Sketch
Let . Then:
Switch to polar coordinates: , , :
Therefore .
Why It Matters
This is one of the most important calculations in all of probability. The polar coordinates trick is a canonical proof technique. Understanding where comes from demystifies the Gaussian and explains why it appears in entropy formulas, the CLT, and information-theoretic quantities.
Failure Mode
Students sometimes assume the Gaussian integrates to 1 "by definition" without understanding the calculation. This leads to confusion when computing marginals of multivariate Gaussians or when normalizing constants matter (e.g., in Bayesian inference and partition functions).
Canonical Examples
Beta-Bernoulli conjugacy in action
Suppose you observe heads in coin flips. With a prior (uniform on ), the posterior is .
The posterior mean is , slightly shrunk from the MLE of toward . The posterior mode is , which equals the MLE. With a stronger prior , the posterior would be with mean , with more shrinkage toward the prior mean of .
Chi-squared as sum of squared normals
If independently, then with and .
Note: each has mean 1 and variance 2. The is sub-exponential but not sub-Gaussian because is a product of two sub-Gaussian variables (and products of sub-Gaussians are sub-exponential).
Common Confusions
The normalizing constant matters
Students sometimes write the Gaussian PDF without or confuse it with (forgetting the ). The full constant is . Getting this wrong means your PDF does not integrate to 1, your log-likelihoods are wrong, and your MLE derivations break. In the multivariate case, the constant involves and .
Gamma parameterization varies between sources
Some sources use the rate parameterization with PDF proportional to , while others use the scale parameterization with PDF proportional to where . Always check which convention a textbook or library uses. NumPy and SciPy use the scale parameterization. This page uses rate.
The Cauchy has no mean, not a mean of zero
The standard Cauchy distribution is symmetric about zero, so you might think its mean is zero. It is not. The mean does not exist because . The "center" of the Cauchy is the median (and mode), which is , but the expectation is undefined.
Summary
- Bernoulli/Binomial/Multinomial: discrete counts; cross-entropy loss comes from Bernoulli likelihood
- Poisson: counts with mean = variance; approximates Binomial for rare events
- Gaussian: the universal limit (CLT); maximizes entropy for given mean and variance
- Exponential/Gamma: positive continuous; memoryless property for Exponential
- Beta/Dirichlet: priors on probabilities; conjugate to Bernoulli/Multinomial
- Chi-squared: sum of squared normals; appears in hypothesis tests and variance estimates
- Student-t: Gaussian with unknown variance; heavier tails accommodate outliers
- Cauchy: the pathological case. no moments, LLN fails
Exercises
Problem
If independently and , what is the distribution of ? What about ?
Problem
Show that by writing out the Beta PDF with .
Problem
Show that if and are independent, then .
Related Comparisons
References
Canonical:
- Casella & Berger, Statistical Inference (2002), Chapters 3-4
- DeGroot & Schervish, Probability and Statistics (2012), Chapters 5-6
Current:
-
Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 2
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapter 2
-
Durrett, Probability: Theory and Examples 5th ed (2019), Chapters 2-3
-
Billingsley, Probability and Measure 3rd ed (1995), Chapters 20-21
Next Topics
Building on these distribution foundations:
- Concentration inequalities: bounding tail probabilities of sums
- Common inequalities: the algebraic tools that connect distributions to bounds
- Empirical risk minimization: where distributional assumptions meet learning
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Anomaly DetectionLayer 2
- Auction TheoryLayer 3
- Base Rate FallacyLayer 1
- Bayesian EstimationLayer 0B
- Bayesian State EstimationLayer 2
- Benford's LawLayer 1
- Birthday ParadoxLayer 0A
- Boltzmann Machines and Hopfield NetworksLayer 2
- Bootstrap MethodsLayer 2
- Causal Inference and the Ladder of CausationLayer 3
- Central Limit TheoremLayer 0B
- Concentration InequalitiesLayer 1
- Confusion Matrices and Classification MetricsLayer 1
- CopulasLayer 3
- Data Preprocessing and Feature EngineeringLayer 1
- Decision Theory FoundationsLayer 2
- Differential PrivacyLayer 3
- DropoutLayer 2
- Empirical Risk MinimizationLayer 2
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Expected Utility TheoryLayer 2
- Extreme Value TheoryLayer 3
- Fast Fourier TransformLayer 1
- Fat Tails and Heavy-Tailed DistributionsLayer 2
- Game Theory FoundationsLayer 2
- Goodness-of-Fit TestsLayer 1
- Cryptographic Hash FunctionsLayer 2
- Importance SamplingLayer 2
- Information Retrieval FoundationsLayer 2
- Joint, Marginal, and Conditional DistributionsLayer 0A
- K-Means ClusteringLayer 1
- Kalman FilterLayer 2
- Kelly CriterionLayer 2
- KL DivergenceLayer 1
- Law of Large NumbersLayer 0B
- Markov Chains and Steady StateLayer 1
- Maximum Likelihood EstimationLayer 0B
- Method of MomentsLayer 0B
- Metropolis-Hastings AlgorithmLayer 2
- Monty Hall ProblemLayer 0A
- Multi-Armed Bandits TheoryLayer 2
- Neyman-Pearson and Hypothesis Testing TheoryLayer 2
- Nonresponse and Missing DataLayer 2
- Normalizing FlowsLayer 3
- Order StatisticsLayer 1
- Prospect TheoryLayer 3
- Public-Key CryptographyLayer 2
- Sample Size DeterminationLayer 2
- Signal Detection TheoryLayer 2
- Skewness, Kurtosis, and Higher MomentsLayer 1
- Survey Sampling MethodsLayer 2
- Synthetic Data GenerationLayer 3
- Wasserstein DistancesLayer 4