Beta Distribution

Sneiderman, Robby

Foundations

Beta Distribution

The Beta distribution as the conjugate prior for Bernoulli and Binomial likelihoods, as the order statistic of i.i.d. Uniforms, and as a flexible density on the unit interval: density, moments, conjugacy derivation, and MLE without closed form.

CoreTier 1StableCore spine~50 min

Prerequisites

Common Probability Distributions Distributions Atlas Gamma Distribution

Prereq Map

Why This Matters

The Beta distribution is the parametric family of densities on the unit interval. Two reasons it earns its own page rather than a single line in a survey:

It is the conjugate prior for any likelihood of the form " $k$ successes in $n$ trials". Bernoulli, Binomial, and Negative Binomial all admit a Beta posterior. The update is among the cleanest in Bayesian statistics: add the number of successes to one shape and the number of failures to the other.
The order statistics of an i.i.d. sample from $\operatorname{Unif}(0,1)$ are Beta distributed. The $k$ -th smallest value out of $n$ uniforms is $\operatorname{Beta}(k, n-k+1)$ . This geometric construction predates the Bayesian interpretation by several decades and is the cleanest way to see why the density has its specific shape.

The Beta is also the marginal of any pair of components of a Dirichlet random vector. The Dirichlet generalizes Beta to the simplex of probability vectors; Beta is the two-category special case.

Definition

Beta Distribution $X \sim Beta (α, β)$

A random variable $X$ has a Beta distribution with shape parameters $\alpha > 0$ and $\beta > 0$ if its density is

$f_X(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)},\qquad 0 < x < 1,$

where $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ is the Beta function.

The density is supported on the open unit interval. The two shape parameters control the location and concentration of the mass: large $\alpha$ relative to $\beta$ pulls the mass toward $1$ ; large $\beta$ relative to $\alpha$ pulls it toward $0$ ; large $\alpha + \beta$ concentrates the mass.

The case $\alpha = \beta = 1$ is the Uniform(0,1) distribution. The case $\alpha = \beta < 1$ is a U-shaped density with mass concentrated at both endpoints; the case $\alpha = \beta > 1$ is symmetric and unimodal at $1/2$ . Asymmetric shape parameters give skewed densities, with the mode at $(\alpha-1)/(\alpha+\beta-2)$ when both shapes exceed one.

Density and Moments

Proposition

Beta Mean and Variance

Statement

For $X\sim\operatorname{Beta}(\alpha,\beta)$ , $\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta},\qquad \operatorname{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.$ More generally, $\mathbb{E}[X^k] = B(\alpha+k,\beta)/B(\alpha,\beta)$ for every positive integer $k$ .

Intuition

The mean depends only on the ratio of shapes, but the variance shrinks as $\alpha + \beta$ grows. Two Beta densities with the same mean can have very different variances; the sum $\alpha + \beta$ is the concentration parameter.

Proof Sketch

Direct computation: $\mathbb{E}[X^k] = \int_0^1 x^k\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\,dx = \frac{B(\alpha+k,\beta)}{B(\alpha,\beta)} = \frac{\Gamma(\alpha+k)\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+k)\Gamma(\alpha)}.$ For $k = 1$ , $\Gamma(\alpha+1)/\Gamma(\alpha) = \alpha$ and $\Gamma(\alpha+\beta)/\Gamma(\alpha+\beta+1) = 1/(\alpha+\beta)$ , giving the mean. The variance follows by computing $\mathbb{E}[X^2]$ and subtracting the squared mean.

Why It Matters

The two parameters together control "where the mass is" (through the mean) and "how concentrated it is" (through $\alpha+\beta$ ). In Bayesian inference, $\alpha + \beta$ acts as a "pseudo-count" of prior observations; the larger it is, the harder it is for new data to move the posterior.

Failure Mode

For $\alpha < 1$ or $\beta < 1$ the density is unbounded at $0$ or $1$ , although it remains integrable. Numerical evaluation of the density near the boundaries requires care; use log-density representations to avoid underflow.

report a correction →

Beta as a Uniform Order Statistic

Theorem

Order Statistics of Uniforms Are Beta

Statement

Let $U_1,\dots,U_n$ be i.i.d. $\operatorname{Unif}(0,1)$ and let $U_{(k)}$ denote the $k$ -th smallest value. Then $U_{(k)}\sim\operatorname{Beta}(k,\,n-k+1).$

Intuition

For $U_{(k)}$ to lie in a small interval near $x$ , we need $k-1$ uniforms to fall below it, the order statistic itself to fall near $x$ , and $n-k$ uniforms to fall above it. The density at $x$ is $\binom{n}{k-1,1,n-k}x^{k-1}(1-x)^{n-k}$ times the density of a single uniform near $x$ , which simplifies to the Beta density with $\alpha = k$ and $\beta = n - k + 1$ .

Proof Sketch

The joint density of $U_{(1)},\dots,U_{(n)}$ is $n!\cdot\mathbf{1}\{0 \le u_1 \le\cdots\le u_n\le 1\}$ . To compute the marginal density of $U_{(k)}$ , fix $u_{(k)} = x$ and integrate over the other coordinates. The lower $k-1$ uniforms lie in $[0, x]$ (volume $x^{k-1}/(k-1)!$ ), and the upper $n-k$ lie in $[x, 1]$ (volume $(1-x)^{n-k}/(n-k)!$ ). Combining: $f_{U_{(k)}}(x) = n!\cdot\frac{x^{k-1}}{(k-1)!}\cdot\frac{(1-x)^{n-k}}{(n-k)!} = \frac{n!}{(k-1)!(n-k)!}x^{k-1}(1-x)^{n-k}.$ The constant $n!/[(k-1)!(n-k)!] = 1/B(k, n-k+1)$ by the relationship between the Beta function and binomial coefficients. So $U_{(k)}\sim\operatorname{Beta}(k,n-k+1)$ .

Why It Matters

This gives a purely geometric origin for the Beta family that does not depend on Bayesian thinking. The Beta is the natural distribution of "the position of the $k$ -th of $n$ uniformly scattered points on the unit interval". The conjugate-prior role for the Bernoulli falls out from the same combinatorial structure: the posterior of $p$ given $k$ successes in $n$ trials is the predictive distribution of the next ranked position.

Failure Mode

The result requires i.i.d. uniforms on $[0, 1]$ . Order statistics of non-uniform samples are not Beta, although they can be transformed to Beta by the probability integral transform. Order statistics of dependent samples are not even close to Beta in general.

report a correction →

Conjugate Prior for the Bernoulli and Binomial

Theorem

Beta-Bernoulli Conjugacy

Statement

Let $X_1,\dots,X_n\sim\operatorname{Bern}(p)$ be i.i.d., let $S = \sum X_i$ be the total successes, and let the prior be $p\sim\operatorname{Beta}(\alpha_0,\beta_0)$ . Then $p\mid X_1,\dots,X_n\sim\operatorname{Beta}(\alpha_0 + S,\ \beta_0 + n - S).$ For an observation of $S\sim\operatorname{Bin}(n,p)$ as a single sufficient summary, the same posterior holds.

Intuition

A Beta prior contributes $\alpha_0 - 1$ pseudo-successes and $\beta_0 - 1$ pseudo-failures. Real data adds real successes and real failures. The posterior is Beta with shape parameters equal to total successes plus pseudo-successes plus one (and the same for failures).

Proof Sketch

The Bernoulli likelihood is $L(p) = \prod_{i=1}^n p^{X_i}(1-p)^{1-X_i} = p^S(1-p)^{n-S}.$ The Beta prior density is proportional to $p^{\alpha_0-1}(1-p)^{\beta_0-1}$ . The posterior is proportional to the product: $\pi(p\mid X)\propto p^{\alpha_0+S-1}(1-p)^{\beta_0+n-S-1},$ which is the kernel of $\operatorname{Beta}(\alpha_0+S, \beta_0+n-S)$ .

Why It Matters

Posterior mean is $(\alpha_0+S)/(\alpha_0+\beta_0+n)$ , a precision-weighted blend of the prior mean $\alpha_0/(\alpha_0+\beta_0)$ and the MLE $S/n$ . The blend weight on the MLE is $n/(\alpha_0+\beta_0+n)$ , which approaches one as data accumulates. The Jeffreys prior for the Bernoulli is $\operatorname{Beta}(1/2, 1/2)$ ; the uniform prior is $\operatorname{Beta}(1, 1)$ ; the Haldane prior is $\operatorname{Beta}(0, 0)$ (improper). All three are admissible starting points with different bias-variance trade-offs.

Failure Mode

Conjugacy is a property of the model, not the data. With Bernoulli data and a Beta prior, the posterior is Beta. Replace the Bernoulli with a probit or any other binary likelihood that is not Bernoulli (different link, different noise) and the Beta is no longer conjugate; the posterior must be computed by integration or sampling.

report a correction →

Connection to the Gamma Distribution

The Beta arises naturally from ratios of independent Gammas. If $X\sim\operatorname{Gamma}(\alpha,1)$ and $Y\sim\operatorname{Gamma}(\beta,1)$ are independent, then

$\frac{X}{X+Y}\sim\operatorname{Beta}(\alpha,\beta),$

and this ratio is independent of $X+Y\sim\operatorname{Gamma}(\alpha+\beta,1)$ . This identity is the reason the Beta function $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ has the form it does: it is the Jacobian-adjusted ratio of Gamma normalizing constants. See gamma distribution for the Gamma side.

The same identity gives a simple way to sample from the Beta: draw $X$ and $Y$ from independent Gammas (which are sums of independent Exponentials when shapes are integers) and compute the ratio.

Maximum Likelihood Estimation

The MLE of $(\alpha,\beta)$ from an i.i.d. Beta sample has no closed form. The log-likelihood is

$\ell(\alpha,\beta) = -n\log B(\alpha,\beta) + (\alpha-1)\sum\log X_i + (\beta-1)\sum\log(1-X_i),$

and the score equations are

$\psi(\alpha) - \psi(\alpha+\beta) = \overline{\log X},\qquad \psi(\beta) - \psi(\alpha+\beta) = \overline{\log(1-X)},$

where $\psi$ is the digamma function and the overlines denote sample averages. These must be solved numerically (e.g., by Newton's method, with a starting point from the method-of-moments estimator). The Fisher information matrix is

$I(\alpha,\beta) = \begin{pmatrix} \psi'(\alpha) - \psi'(\alpha+\beta) & -\psi'(\alpha+\beta) \\ -\psi'(\alpha+\beta) & \psi'(\beta) - \psi'(\alpha+\beta) \end{pmatrix}.$

In Bayesian workflows the MLE is rarely used; the posterior is typically the object of interest, and the Beta-Bernoulli conjugacy gives the posterior in closed form regardless of how the prior was chosen.

Method of Moments (Closed Form)

The method-of-moments estimator has a closed form. With $\bar X_n$ the sample mean and $\hat\sigma^2_n$ the sample variance,

$\hat\alpha_{\text{MoM}} = \bar X_n\!\left(\frac{\bar X_n(1-\bar X_n)}{\hat\sigma^2_n} - 1\right),\qquad \hat\beta_{\text{MoM}} = (1-\bar X_n)\!\left(\frac{\bar X_n(1-\bar X_n)}{\hat\sigma^2_n} - 1\right).$

The estimator is consistent and almost always reasonable as a starting point for MLE iteration. See method of moments for the general framework.

Common Confusions

Watch Out

Pseudo-counts versus real counts

A Beta(\alpha_0, \beta_0) prior is sometimes described as "equivalent to having seen $\alpha_0 - 1$ successes and $\beta_0 - 1$ failures". This intuition is right in expectation but wrong in variance: the prior also has a "concentration" $\alpha_0 + \beta_0$ that real data cannot match unless the data sample is at least that large. Conjugate priors compress prior information into a sufficient statistic, but they do not replace it with imaginary observations.

Watch Out

Beta(1, 1) is the uniform distribution

The Beta(1, 1) density is $f(x) = 1$ for $0 \le x \le 1$ , which is the Uniform(0, 1) density. The Uniform is a special Beta, and the Beta is a one-parameter generalization of the Uniform whenever the shape parameters are not both one. Some Bayesian software libraries default to a Beta(1, 1) prior; this is the noninformative-uniform-prior choice, not the Jeffreys prior.

Watch Out

The mode is not the mean

For $\alpha, \beta > 1$ the mode is $(\alpha-1)/(\alpha+\beta-2)$ and the mean is $\alpha/(\alpha+\beta)$ . They coincide only when $\alpha = \beta$ (symmetric Beta). Bayesian MAP estimates report the mode; posterior-mean estimates report the mean. The two disagree under asymmetric priors and small sample sizes.

Exercises

ExerciseCore

Problem

Let $X\sim\operatorname{Beta}(3, 2)$ . Compute $\mathbb{E}[X]$ , $\operatorname{Var}(X)$ , and the mode.

ExerciseCore

Problem

An A/B test starts with a Beta(1, 1) prior on the conversion probability $p$ . After serving 200 users, 36 convert. Find the posterior distribution and the posterior mean, and compare to the MLE.

ExerciseAdvanced

Problem

Let $U_1,\dots,U_{10}$ be i.i.d. $\operatorname{Unif}(0,1)$ . Identify the distribution of the median $U_{(5)}$ (with $n = 10$ ) and the third-largest value $U_{(8)}$ .

ExerciseAdvanced

Problem

Let $X\sim\operatorname{Gamma}(\alpha,1)$ and $Y\sim\operatorname{Gamma}(\beta,1)$ be independent. Show that $X/(X+Y)\sim\operatorname{Beta}(\alpha,\beta)$ , independent of $X+Y$ .

References

Canonical:

Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on Beta), Chapter 5 (Section 5.4 on order statistics).
Lehmann and Casella, Theory of Point Estimation (1998), Chapter 4 (conjugate priors).
Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (conjugate families).

Bayesian framing:

Gelman et al., Bayesian Data Analysis (2013), Chapter 2 (Section 2.4 on the Beta-Binomial model).
Robert, The Bayesian Choice (2007), Chapter 3.
Jaynes, Probability Theory: The Logic of Science (2003), Chapter 6 (uniform and Jeffreys priors for the Bernoulli).

Order statistics:

David and Nagaraja, Order Statistics (2003), Chapter 2 (uniform order statistics and the Beta distribution).

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Distributions Atlaslayer 0A · tier 1
Gamma Distributionlayer 0A · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.