Skip to main content

Foundations

Beta Distribution

The Beta distribution as the conjugate prior for Bernoulli and Binomial likelihoods, as the order statistic of i.i.d. Uniforms, and as a flexible density on the unit interval: density, moments, conjugacy derivation, and MLE without closed form.

CoreTier 1StableCore spine~50 min

Why This Matters

The Beta distribution is the parametric family of densities on the unit interval. Two reasons it earns its own page rather than a single line in a survey:

  1. It is the conjugate prior for any likelihood of the form "kk successes in nn trials". Bernoulli, Binomial, and Negative Binomial all admit a Beta posterior. The update is among the cleanest in Bayesian statistics: add the number of successes to one shape and the number of failures to the other.
  2. The order statistics of an i.i.d. sample from Unif(0,1)\operatorname{Unif}(0,1) are Beta distributed. The kk-th smallest value out of nn uniforms is Beta(k,nk+1)\operatorname{Beta}(k, n-k+1). This geometric construction predates the Bayesian interpretation by several decades and is the cleanest way to see why the density has its specific shape.

The Beta is also the marginal of any pair of components of a Dirichlet random vector. The Dirichlet generalizes Beta to the simplex of probability vectors; Beta is the two-category special case.

Definition

Definition

Beta Distribution

A random variable XX has a Beta distribution with shape parameters α>0\alpha > 0 and β>0\beta > 0 if its density is

fX(x)=xα1(1x)β1B(α,β),0<x<1,f_X(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)},\qquad 0 < x < 1,

where B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) is the Beta function.

The density is supported on the open unit interval. The two shape parameters control the location and concentration of the mass: large α\alpha relative to β\beta pulls the mass toward 11; large β\beta relative to α\alpha pulls it toward 00; large α+β\alpha + \beta concentrates the mass.

The case α=β=1\alpha = \beta = 1 is the Uniform(0,1) distribution. The case α=β<1\alpha = \beta < 1 is a U-shaped density with mass concentrated at both endpoints; the case α=β>1\alpha = \beta > 1 is symmetric and unimodal at 1/21/2. Asymmetric shape parameters give skewed densities, with the mode at (α1)/(α+β2)(\alpha-1)/(\alpha+\beta-2) when both shapes exceed one.

Density and Moments

Proposition

Beta Mean and Variance

Statement

For XBeta(α,β)X\sim\operatorname{Beta}(\alpha,\beta), E[X]=αα+β,Var(X)=αβ(α+β)2(α+β+1).\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta},\qquad \operatorname{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}. More generally, E[Xk]=B(α+k,β)/B(α,β)\mathbb{E}[X^k] = B(\alpha+k,\beta)/B(\alpha,\beta) for every positive integer kk.

Intuition

The mean depends only on the ratio of shapes, but the variance shrinks as α+β\alpha + \beta grows. Two Beta densities with the same mean can have very different variances; the sum α+β\alpha + \beta is the concentration parameter.

Proof Sketch

Direct computation: E[Xk]=01xkxα1(1x)β1B(α,β)dx=B(α+k,β)B(α,β)=Γ(α+k)Γ(α+β)Γ(α+β+k)Γ(α).\mathbb{E}[X^k] = \int_0^1 x^k\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\,dx = \frac{B(\alpha+k,\beta)}{B(\alpha,\beta)} = \frac{\Gamma(\alpha+k)\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+k)\Gamma(\alpha)}. For k=1k = 1, Γ(α+1)/Γ(α)=α\Gamma(\alpha+1)/\Gamma(\alpha) = \alpha and Γ(α+β)/Γ(α+β+1)=1/(α+β)\Gamma(\alpha+\beta)/\Gamma(\alpha+\beta+1) = 1/(\alpha+\beta), giving the mean. The variance follows by computing E[X2]\mathbb{E}[X^2] and subtracting the squared mean.

Why It Matters

The two parameters together control "where the mass is" (through the mean) and "how concentrated it is" (through α+β\alpha+\beta). In Bayesian inference, α+β\alpha + \beta acts as a "pseudo-count" of prior observations; the larger it is, the harder it is for new data to move the posterior.

Failure Mode

For α<1\alpha < 1 or β<1\beta < 1 the density is unbounded at 00 or 11, although it remains integrable. Numerical evaluation of the density near the boundaries requires care; use log-density representations to avoid underflow.

Beta as a Uniform Order Statistic

Theorem

Order Statistics of Uniforms Are Beta

Statement

Let U1,,UnU_1,\dots,U_n be i.i.d. Unif(0,1)\operatorname{Unif}(0,1) and let U(k)U_{(k)} denote the kk-th smallest value. Then U(k)Beta(k,nk+1).U_{(k)}\sim\operatorname{Beta}(k,\,n-k+1).

Intuition

For U(k)U_{(k)} to lie in a small interval near xx, we need k1k-1 uniforms to fall below it, the order statistic itself to fall near xx, and nkn-k uniforms to fall above it. The density at xx is (nk1,1,nk)xk1(1x)nk\binom{n}{k-1,1,n-k}x^{k-1}(1-x)^{n-k} times the density of a single uniform near xx, which simplifies to the Beta density with α=k\alpha = k and β=nk+1\beta = n - k + 1.

Proof Sketch

The joint density of U(1),,U(n)U_{(1)},\dots,U_{(n)} is n!1{0u1un1}n!\cdot\mathbf{1}\{0 \le u_1 \le\cdots\le u_n\le 1\}. To compute the marginal density of U(k)U_{(k)}, fix u(k)=xu_{(k)} = x and integrate over the other coordinates. The lower k1k-1 uniforms lie in [0,x][0, x] (volume xk1/(k1)!x^{k-1}/(k-1)!), and the upper nkn-k lie in [x,1][x, 1] (volume (1x)nk/(nk)!(1-x)^{n-k}/(n-k)!). Combining: fU(k)(x)=n!xk1(k1)!(1x)nk(nk)!=n!(k1)!(nk)!xk1(1x)nk.f_{U_{(k)}}(x) = n!\cdot\frac{x^{k-1}}{(k-1)!}\cdot\frac{(1-x)^{n-k}}{(n-k)!} = \frac{n!}{(k-1)!(n-k)!}x^{k-1}(1-x)^{n-k}. The constant n!/[(k1)!(nk)!]=1/B(k,nk+1)n!/[(k-1)!(n-k)!] = 1/B(k, n-k+1) by the relationship between the Beta function and binomial coefficients. So U(k)Beta(k,nk+1)U_{(k)}\sim\operatorname{Beta}(k,n-k+1).

Why It Matters

This gives a purely geometric origin for the Beta family that does not depend on Bayesian thinking. The Beta is the natural distribution of "the position of the kk-th of nn uniformly scattered points on the unit interval". The conjugate-prior role for the Bernoulli falls out from the same combinatorial structure: the posterior of pp given kk successes in nn trials is the predictive distribution of the next ranked position.

Failure Mode

The result requires i.i.d. uniforms on [0,1][0, 1]. Order statistics of non-uniform samples are not Beta, although they can be transformed to Beta by the probability integral transform. Order statistics of dependent samples are not even close to Beta in general.

Conjugate Prior for the Bernoulli and Binomial

Theorem

Beta-Bernoulli Conjugacy

Statement

Let X1,,XnBern(p)X_1,\dots,X_n\sim\operatorname{Bern}(p) be i.i.d., let S=XiS = \sum X_i be the total successes, and let the prior be pBeta(α0,β0)p\sim\operatorname{Beta}(\alpha_0,\beta_0). Then pX1,,XnBeta(α0+S, β0+nS).p\mid X_1,\dots,X_n\sim\operatorname{Beta}(\alpha_0 + S,\ \beta_0 + n - S). For an observation of SBin(n,p)S\sim\operatorname{Bin}(n,p) as a single sufficient summary, the same posterior holds.

Intuition

A Beta prior contributes α01\alpha_0 - 1 pseudo-successes and β01\beta_0 - 1 pseudo-failures. Real data adds real successes and real failures. The posterior is Beta with shape parameters equal to total successes plus pseudo-successes plus one (and the same for failures).

Proof Sketch

The Bernoulli likelihood is L(p)=i=1npXi(1p)1Xi=pS(1p)nS.L(p) = \prod_{i=1}^n p^{X_i}(1-p)^{1-X_i} = p^S(1-p)^{n-S}. The Beta prior density is proportional to pα01(1p)β01p^{\alpha_0-1}(1-p)^{\beta_0-1}. The posterior is proportional to the product: π(pX)pα0+S1(1p)β0+nS1,\pi(p\mid X)\propto p^{\alpha_0+S-1}(1-p)^{\beta_0+n-S-1}, which is the kernel of Beta(α0+S,β0+nS)\operatorname{Beta}(\alpha_0+S, \beta_0+n-S).

Why It Matters

Posterior mean is (α0+S)/(α0+β0+n)(\alpha_0+S)/(\alpha_0+\beta_0+n), a precision-weighted blend of the prior mean α0/(α0+β0)\alpha_0/(\alpha_0+\beta_0) and the MLE S/nS/n. The blend weight on the MLE is n/(α0+β0+n)n/(\alpha_0+\beta_0+n), which approaches one as data accumulates. The Jeffreys prior for the Bernoulli is Beta(1/2,1/2)\operatorname{Beta}(1/2, 1/2); the uniform prior is Beta(1,1)\operatorname{Beta}(1, 1); the Haldane prior is Beta(0,0)\operatorname{Beta}(0, 0) (improper). All three are admissible starting points with different bias-variance trade-offs.

Failure Mode

Conjugacy is a property of the model, not the data. With Bernoulli data and a Beta prior, the posterior is Beta. Replace the Bernoulli with a probit or any other binary likelihood that is not Bernoulli (different link, different noise) and the Beta is no longer conjugate; the posterior must be computed by integration or sampling.

Connection to the Gamma Distribution

The Beta arises naturally from ratios of independent Gammas. If XGamma(α,1)X\sim\operatorname{Gamma}(\alpha,1) and YGamma(β,1)Y\sim\operatorname{Gamma}(\beta,1) are independent, then

XX+YBeta(α,β),\frac{X}{X+Y}\sim\operatorname{Beta}(\alpha,\beta),

and this ratio is independent of X+YGamma(α+β,1)X+Y\sim\operatorname{Gamma}(\alpha+\beta,1). This identity is the reason the Beta function B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) has the form it does: it is the Jacobian-adjusted ratio of Gamma normalizing constants. See gamma distribution for the Gamma side.

The same identity gives a simple way to sample from the Beta: draw XX and YY from independent Gammas (which are sums of independent Exponentials when shapes are integers) and compute the ratio.

Maximum Likelihood Estimation

The MLE of (α,β)(\alpha,\beta) from an i.i.d. Beta sample has no closed form. The log-likelihood is

(α,β)=nlogB(α,β)+(α1)logXi+(β1)log(1Xi),\ell(\alpha,\beta) = -n\log B(\alpha,\beta) + (\alpha-1)\sum\log X_i + (\beta-1)\sum\log(1-X_i),

and the score equations are

ψ(α)ψ(α+β)=logX,ψ(β)ψ(α+β)=log(1X),\psi(\alpha) - \psi(\alpha+\beta) = \overline{\log X},\qquad \psi(\beta) - \psi(\alpha+\beta) = \overline{\log(1-X)},

where ψ\psi is the digamma function and the overlines denote sample averages. These must be solved numerically (e.g., by Newton's method, with a starting point from the method-of-moments estimator). The Fisher information matrix is

I(α,β)=(ψ(α)ψ(α+β)ψ(α+β)ψ(α+β)ψ(β)ψ(α+β)).I(\alpha,\beta) = \begin{pmatrix} \psi'(\alpha) - \psi'(\alpha+\beta) & -\psi'(\alpha+\beta) \\ -\psi'(\alpha+\beta) & \psi'(\beta) - \psi'(\alpha+\beta) \end{pmatrix}.

In Bayesian workflows the MLE is rarely used; the posterior is typically the object of interest, and the Beta-Bernoulli conjugacy gives the posterior in closed form regardless of how the prior was chosen.

Method of Moments (Closed Form)

The method-of-moments estimator has a closed form. With Xˉn\bar X_n the sample mean and σ^n2\hat\sigma^2_n the sample variance,

α^MoM=Xˉn ⁣(Xˉn(1Xˉn)σ^n21),β^MoM=(1Xˉn) ⁣(Xˉn(1Xˉn)σ^n21).\hat\alpha_{\text{MoM}} = \bar X_n\!\left(\frac{\bar X_n(1-\bar X_n)}{\hat\sigma^2_n} - 1\right),\qquad \hat\beta_{\text{MoM}} = (1-\bar X_n)\!\left(\frac{\bar X_n(1-\bar X_n)}{\hat\sigma^2_n} - 1\right).

The estimator is consistent and almost always reasonable as a starting point for MLE iteration. See method of moments for the general framework.

Common Confusions

Watch Out

Pseudo-counts versus real counts

A Beta(\alpha_0, \beta_0) prior is sometimes described as "equivalent to having seen α01\alpha_0 - 1 successes and β01\beta_0 - 1 failures". This intuition is right in expectation but wrong in variance: the prior also has a "concentration" α0+β0\alpha_0 + \beta_0 that real data cannot match unless the data sample is at least that large. Conjugate priors compress prior information into a sufficient statistic, but they do not replace it with imaginary observations.

Watch Out

Beta(1, 1) is the uniform distribution

The Beta(1, 1) density is f(x)=1f(x) = 1 for 0x10 \le x \le 1, which is the Uniform(0, 1) density. The Uniform is a special Beta, and the Beta is a one-parameter generalization of the Uniform whenever the shape parameters are not both one. Some Bayesian software libraries default to a Beta(1, 1) prior; this is the noninformative-uniform-prior choice, not the Jeffreys prior.

Watch Out

The mode is not the mean

For α,β>1\alpha, \beta > 1 the mode is (α1)/(α+β2)(\alpha-1)/(\alpha+\beta-2) and the mean is α/(α+β)\alpha/(\alpha+\beta). They coincide only when α=β\alpha = \beta (symmetric Beta). Bayesian MAP estimates report the mode; posterior-mean estimates report the mean. The two disagree under asymmetric priors and small sample sizes.

Exercises

ExerciseCore

Problem

Let XBeta(3,2)X\sim\operatorname{Beta}(3, 2). Compute E[X]\mathbb{E}[X], Var(X)\operatorname{Var}(X), and the mode.

ExerciseCore

Problem

An A/B test starts with a Beta(1, 1) prior on the conversion probability pp. After serving 200 users, 36 convert. Find the posterior distribution and the posterior mean, and compare to the MLE.

ExerciseAdvanced

Problem

Let U1,,U10U_1,\dots,U_{10} be i.i.d. Unif(0,1)\operatorname{Unif}(0,1). Identify the distribution of the median U(5)U_{(5)} (with n=10n = 10) and the third-largest value U(8)U_{(8)}.

ExerciseAdvanced

Problem

Let XGamma(α,1)X\sim\operatorname{Gamma}(\alpha,1) and YGamma(β,1)Y\sim\operatorname{Gamma}(\beta,1) be independent. Show that X/(X+Y)Beta(α,β)X/(X+Y)\sim\operatorname{Beta}(\alpha,\beta), independent of X+YX+Y.

References

Canonical:

  • Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on Beta), Chapter 5 (Section 5.4 on order statistics).
  • Lehmann and Casella, Theory of Point Estimation (1998), Chapter 4 (conjugate priors).
  • Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (conjugate families).

Bayesian framing:

  • Gelman et al., Bayesian Data Analysis (2013), Chapter 2 (Section 2.4 on the Beta-Binomial model).
  • Robert, The Bayesian Choice (2007), Chapter 3.
  • Jaynes, Probability Theory: The Logic of Science (2003), Chapter 6 (uniform and Jeffreys priors for the Bernoulli).

Order statistics:

  • David and Nagaraja, Order Statistics (2003), Chapter 2 (uniform order statistics and the Beta distribution).

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.