Foundations
Beta Distribution
The Beta distribution as the conjugate prior for Bernoulli and Binomial likelihoods, as the order statistic of i.i.d. Uniforms, and as a flexible density on the unit interval: density, moments, conjugacy derivation, and MLE without closed form.
Why This Matters
The Beta distribution is the parametric family of densities on the unit interval. Two reasons it earns its own page rather than a single line in a survey:
- It is the conjugate prior for any likelihood of the form " successes in trials". Bernoulli, Binomial, and Negative Binomial all admit a Beta posterior. The update is among the cleanest in Bayesian statistics: add the number of successes to one shape and the number of failures to the other.
- The order statistics of an i.i.d. sample from are Beta distributed. The -th smallest value out of uniforms is . This geometric construction predates the Bayesian interpretation by several decades and is the cleanest way to see why the density has its specific shape.
The Beta is also the marginal of any pair of components of a Dirichlet random vector. The Dirichlet generalizes Beta to the simplex of probability vectors; Beta is the two-category special case.
Definition
Beta Distribution
A random variable has a Beta distribution with shape parameters and if its density is
where is the Beta function.
The density is supported on the open unit interval. The two shape parameters control the location and concentration of the mass: large relative to pulls the mass toward ; large relative to pulls it toward ; large concentrates the mass.
The case is the Uniform(0,1) distribution. The case is a U-shaped density with mass concentrated at both endpoints; the case is symmetric and unimodal at . Asymmetric shape parameters give skewed densities, with the mode at when both shapes exceed one.
Density and Moments
Beta Mean and Variance
Statement
For , More generally, for every positive integer .
Intuition
The mean depends only on the ratio of shapes, but the variance shrinks as grows. Two Beta densities with the same mean can have very different variances; the sum is the concentration parameter.
Proof Sketch
Direct computation: For , and , giving the mean. The variance follows by computing and subtracting the squared mean.
Why It Matters
The two parameters together control "where the mass is" (through the mean) and "how concentrated it is" (through ). In Bayesian inference, acts as a "pseudo-count" of prior observations; the larger it is, the harder it is for new data to move the posterior.
Failure Mode
For or the density is unbounded at or , although it remains integrable. Numerical evaluation of the density near the boundaries requires care; use log-density representations to avoid underflow.
Beta as a Uniform Order Statistic
Order Statistics of Uniforms Are Beta
Statement
Let be i.i.d. and let denote the -th smallest value. Then
Intuition
For to lie in a small interval near , we need uniforms to fall below it, the order statistic itself to fall near , and uniforms to fall above it. The density at is times the density of a single uniform near , which simplifies to the Beta density with and .
Proof Sketch
The joint density of is . To compute the marginal density of , fix and integrate over the other coordinates. The lower uniforms lie in (volume ), and the upper lie in (volume ). Combining: The constant by the relationship between the Beta function and binomial coefficients. So .
Why It Matters
This gives a purely geometric origin for the Beta family that does not depend on Bayesian thinking. The Beta is the natural distribution of "the position of the -th of uniformly scattered points on the unit interval". The conjugate-prior role for the Bernoulli falls out from the same combinatorial structure: the posterior of given successes in trials is the predictive distribution of the next ranked position.
Failure Mode
The result requires i.i.d. uniforms on . Order statistics of non-uniform samples are not Beta, although they can be transformed to Beta by the probability integral transform. Order statistics of dependent samples are not even close to Beta in general.
Conjugate Prior for the Bernoulli and Binomial
Beta-Bernoulli Conjugacy
Statement
Let be i.i.d., let be the total successes, and let the prior be . Then For an observation of as a single sufficient summary, the same posterior holds.
Intuition
A Beta prior contributes pseudo-successes and pseudo-failures. Real data adds real successes and real failures. The posterior is Beta with shape parameters equal to total successes plus pseudo-successes plus one (and the same for failures).
Proof Sketch
The Bernoulli likelihood is The Beta prior density is proportional to . The posterior is proportional to the product: which is the kernel of .
Why It Matters
Posterior mean is , a precision-weighted blend of the prior mean and the MLE . The blend weight on the MLE is , which approaches one as data accumulates. The Jeffreys prior for the Bernoulli is ; the uniform prior is ; the Haldane prior is (improper). All three are admissible starting points with different bias-variance trade-offs.
Failure Mode
Conjugacy is a property of the model, not the data. With Bernoulli data and a Beta prior, the posterior is Beta. Replace the Bernoulli with a probit or any other binary likelihood that is not Bernoulli (different link, different noise) and the Beta is no longer conjugate; the posterior must be computed by integration or sampling.
Connection to the Gamma Distribution
The Beta arises naturally from ratios of independent Gammas. If and are independent, then
and this ratio is independent of . This identity is the reason the Beta function has the form it does: it is the Jacobian-adjusted ratio of Gamma normalizing constants. See gamma distribution for the Gamma side.
The same identity gives a simple way to sample from the Beta: draw and from independent Gammas (which are sums of independent Exponentials when shapes are integers) and compute the ratio.
Maximum Likelihood Estimation
The MLE of from an i.i.d. Beta sample has no closed form. The log-likelihood is
and the score equations are
where is the digamma function and the overlines denote sample averages. These must be solved numerically (e.g., by Newton's method, with a starting point from the method-of-moments estimator). The Fisher information matrix is
In Bayesian workflows the MLE is rarely used; the posterior is typically the object of interest, and the Beta-Bernoulli conjugacy gives the posterior in closed form regardless of how the prior was chosen.
Method of Moments (Closed Form)
The method-of-moments estimator has a closed form. With the sample mean and the sample variance,
The estimator is consistent and almost always reasonable as a starting point for MLE iteration. See method of moments for the general framework.
Common Confusions
Pseudo-counts versus real counts
A Beta(\alpha_0, \beta_0) prior is sometimes described as "equivalent to having seen successes and failures". This intuition is right in expectation but wrong in variance: the prior also has a "concentration" that real data cannot match unless the data sample is at least that large. Conjugate priors compress prior information into a sufficient statistic, but they do not replace it with imaginary observations.
Beta(1, 1) is the uniform distribution
The Beta(1, 1) density is for , which is the Uniform(0, 1) density. The Uniform is a special Beta, and the Beta is a one-parameter generalization of the Uniform whenever the shape parameters are not both one. Some Bayesian software libraries default to a Beta(1, 1) prior; this is the noninformative-uniform-prior choice, not the Jeffreys prior.
The mode is not the mean
For the mode is and the mean is . They coincide only when (symmetric Beta). Bayesian MAP estimates report the mode; posterior-mean estimates report the mean. The two disagree under asymmetric priors and small sample sizes.
Exercises
Problem
Let . Compute , , and the mode.
Problem
An A/B test starts with a Beta(1, 1) prior on the conversion probability . After serving 200 users, 36 convert. Find the posterior distribution and the posterior mean, and compare to the MLE.
Problem
Let be i.i.d. . Identify the distribution of the median (with ) and the third-largest value .
Problem
Let and be independent. Show that , independent of .
References
Canonical:
- Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on Beta), Chapter 5 (Section 5.4 on order statistics).
- Lehmann and Casella, Theory of Point Estimation (1998), Chapter 4 (conjugate priors).
- Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (conjugate families).
Bayesian framing:
- Gelman et al., Bayesian Data Analysis (2013), Chapter 2 (Section 2.4 on the Beta-Binomial model).
- Robert, The Bayesian Choice (2007), Chapter 3.
- Jaynes, Probability Theory: The Logic of Science (2003), Chapter 6 (uniform and Jeffreys priors for the Bernoulli).
Order statistics:
- David and Nagaraja, Order Statistics (2003), Chapter 2 (uniform order statistics and the Beta distribution).
Last reviewed: May 11, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Common Probability Distributionslayer 0A · tier 1
- Distributions Atlaslayer 0A · tier 1
- Gamma Distributionlayer 0A · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.