Skip to main content

Foundations

Beta Distribution

The Beta distribution as the conjugate prior for Bernoulli and Binomial likelihoods, as the order statistic of i.i.d. Uniforms, and as a flexible density on the unit interval: density, moments, conjugacy derivation, and MLE without closed form.

CoreTier 1StableCore spine~50 min

Learning position

Read this page in the graph.

foundations | layer 0A | tier 1. This page has 3 direct prerequisites and 0 published dependents.

What next

Bayesian Estimation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

Why This Matters

The Beta distribution is the parametric family of densities on the unit interval. Two reasons it earns its own page rather than a single line in a survey:

  1. It is the conjugate prior for any likelihood of the form "kk successes in nn trials". Bernoulli, Binomial, and Negative Binomial all admit a Beta posterior. The update is among the cleanest in Bayesian statistics: add the number of successes to one shape and the number of failures to the other.
  2. The order statistics of an i.i.d. sample from Unif(0,1)\operatorname{Unif}(0,1) are Beta distributed. The kk-th smallest value out of nn uniforms is Beta(k,nk+1)\operatorname{Beta}(k, n-k+1). This geometric construction predates the Bayesian interpretation by several decades and is the cleanest way to see why the density has its specific shape.

The Beta is also the marginal of any pair of components of a Dirichlet random vector. The Dirichlet generalizes Beta to the simplex of probability vectors; Beta is the two-category special case.

Definition

Definition

Beta Distribution

A random variable XX has a Beta distribution with shape parameters α>0\alpha > 0 and β>0\beta > 0 if its density is

fX(x)=xα1(1x)β1B(α,β),0<x<1,f_X(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)},\qquad 0 < x < 1,

where B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) is the Beta function.

The density is supported on the open unit interval. The two shape parameters control the location and concentration of the mass: large α\alpha relative to β\beta pulls the mass toward 11; large β\beta relative to α\alpha pulls it toward 00; large α+β\alpha + \beta concentrates the mass.

The case α=β=1\alpha = \beta = 1 is the Uniform(0,1) distribution. The case α=β<1\alpha = \beta < 1 is a U-shaped density with mass concentrated at both endpoints; the case α=β>1\alpha = \beta > 1 is symmetric and unimodal at 1/21/2. Asymmetric shape parameters give skewed densities, with the mode at (α1)/(α+β2)(\alpha-1)/(\alpha+\beta-2) when both shapes exceed one.

The easiest way to read the parameters is to separate direction from concentration:

ParametersMeanShapeMechanism
Beta(1,1)\operatorname{Beta}(1,1)0.50.5flatno preference inside (0,1)(0,1)
Beta(8,2)\operatorname{Beta}(8,2)0.80.8mass near 11success count dominates failure count
Beta(2,8)\operatorname{Beta}(2,8)0.20.2mass near 00failure count dominates success count
Beta(20,20)\operatorname{Beta}(20,20)0.50.5tight around 1/21/2large total count with balanced shapes
Beta(0.5,0.5)\operatorname{Beta}(0.5,0.5)0.50.5U-shapedboth endpoints receive singular density

The ratio α/(α+β)\alpha/(\alpha+\beta) sets the center. The sum α+β\alpha+\beta sets how hard the density resists movement away from that center. This is why Beta(2,2)\operatorname{Beta}(2,2) and Beta(200,200)\operatorname{Beta}(200,200) have the same mean but behave differently in a Bayesian update: the first moves after a few observations; the second needs a large sample before the posterior mean changes much.

Density and Moments

Proposition

Beta Mean and Variance

Statement

For XBeta(α,β)X\sim\operatorname{Beta}(\alpha,\beta), E[X]=αα+β,Var(X)=αβ(α+β)2(α+β+1).\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta},\qquad \operatorname{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}. More generally, E[Xk]=B(α+k,β)/B(α,β)\mathbb{E}[X^k] = B(\alpha+k,\beta)/B(\alpha,\beta) for every positive integer kk.

Intuition

The mean depends only on the ratio of shapes, but the variance shrinks as α+β\alpha + \beta grows. Two Beta densities with the same mean can have very different variances; the sum α+β\alpha + \beta is the concentration parameter.

Proof Sketch

Direct computation: E[Xk]=01xkxα1(1x)β1B(α,β)dx=B(α+k,β)B(α,β)=Γ(α+k)Γ(α+β)Γ(α+β+k)Γ(α).\mathbb{E}[X^k] = \int_0^1 x^k\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\,dx = \frac{B(\alpha+k,\beta)}{B(\alpha,\beta)} = \frac{\Gamma(\alpha+k)\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+k)\Gamma(\alpha)}. For k=1k = 1, Γ(α+1)/Γ(α)=α\Gamma(\alpha+1)/\Gamma(\alpha) = \alpha and Γ(α+β)/Γ(α+β+1)=1/(α+β)\Gamma(\alpha+\beta)/\Gamma(\alpha+\beta+1) = 1/(\alpha+\beta), giving the mean. The variance follows by computing E[X2]\mathbb{E}[X^2] and subtracting the squared mean.

Why It Matters

The two parameters together control "where the mass is" (through the mean) and "how concentrated it is" (through α+β\alpha+\beta). In Bayesian inference, α+β\alpha + \beta acts as a "pseudo-count" of prior observations; the larger it is, the harder it is for new data to move the posterior.

Failure Mode

The moment formulas require α>0\alpha > 0 and β>0\beta > 0. If either shape is nonpositive, B(α,β)B(\alpha,\beta) is not the normalizing constant of a probability density on (0,1)(0,1). For α<1\alpha < 1 or β<1\beta < 1, the density is unbounded at 00 or 11, although it remains integrable. Numerical evaluation near the boundaries requires log-density arithmetic; direct evaluation of xα1(1x)β1x^{\alpha-1}(1-x)^{\beta-1} can overflow near a singular endpoint and underflow when both shapes are large.

Worked Examples

Example

Same mean, different concentration

Compare XBeta(2,2)X\sim\operatorname{Beta}(2,2) and YBeta(20,20)Y\sim\operatorname{Beta}(20,20). Both have mean 1/21/2. The variances differ:

Var(X)=22425=0.05,Var(Y)=2020402410.0061.\operatorname{Var}(X)=\frac{2\cdot2}{4^2\cdot5}=0.05,\qquad \operatorname{Var}(Y)=\frac{20\cdot20}{40^2\cdot41}\approx0.0061.

The mechanism is the denominator α+β+1\alpha+\beta+1. Multiplying both shapes by ten leaves the mean unchanged but increases the concentration. In a Beta-Bernoulli model, Beta(2,2)\operatorname{Beta}(2,2) behaves like a weak prior centered at 1/21/2; Beta(20,20)\operatorname{Beta}(20,20) behaves like a prior with much more resistance. They encode the same location, not the same information.

Example

Posterior mean as a weighted average

Start with pBeta(8,2)p\sim\operatorname{Beta}(8,2), whose prior mean is 0.80.8. Observe S=6S=6 successes in n=20n=20 Bernoulli trials. The posterior is pXBeta(8+6, 2+14)=Beta(14,16).p\mid X\sim\operatorname{Beta}(8+6,\ 2+14)=\operatorname{Beta}(14,16). The posterior mean is 14/300.46714/30\approx0.467.

The same number appears from the weighted-average form:

α0+β0α0+β0+nα0α0+β0+nα0+β0+nSn=10300.8+20300.3=0.467.\frac{\alpha_0+\beta_0}{\alpha_0+\beta_0+n}\cdot \frac{\alpha_0}{\alpha_0+\beta_0} + \frac{n}{\alpha_0+\beta_0+n}\cdot\frac{S}{n} = \frac{10}{30}\cdot0.8+\frac{20}{30}\cdot0.3 =0.467.

The posterior does not average the prior density with the empirical frequency. It averages two probability estimates with weights given by prior concentration and sample size.

Example

Uniform order statistic by counting gaps

Let U1,,U5U_1,\dots,U_5 be i.i.d. uniforms and consider the second-smallest value U(2)U_{(2)}. The theorem gives U(2)Beta(2,4)U_{(2)}\sim\operatorname{Beta}(2,4), so the mean is 2/6=1/32/6=1/3.

The density can be read without memorizing the formula. For U(2)U_{(2)} to land near xx, one observation must fall below xx, one must land near xx, and three must fall above xx. The number of assignments is 5!/(1!1!3!)=205!/(1!1!3!)=20, so fU(2)(x)=20x(1x)3,0<x<1.f_{U_{(2)}}(x)=20x(1-x)^3,\qquad0<x<1. This is exactly the Beta(2,4)\operatorname{Beta}(2,4) density. The exponent of xx counts observations forced below the statistic; the exponent of 1x1-x counts observations forced above it.

Beta as a Uniform Order Statistic

Theorem

Order Statistics of Uniforms Are Beta

Statement

Let U1,,UnU_1,\dots,U_n be i.i.d. Unif(0,1)\operatorname{Unif}(0,1) and let U(k)U_{(k)} denote the kk-th smallest value. Then U(k)Beta(k,nk+1).U_{(k)}\sim\operatorname{Beta}(k,\,n-k+1).

Intuition

For U(k)U_{(k)} to lie in a small interval near xx, we need k1k-1 uniforms to fall below it, the order statistic itself to fall near xx, and nkn-k uniforms to fall above it. The density at xx is (nk1,1,nk)xk1(1x)nk\binom{n}{k-1,1,n-k}x^{k-1}(1-x)^{n-k} times the density of a single uniform near xx, which simplifies to the Beta density with α=k\alpha = k and β=nk+1\beta = n - k + 1.

Proof Sketch

The joint density of U(1),,U(n)U_{(1)},\dots,U_{(n)} is n!1{0u1un1}n!\cdot\mathbf{1}\{0 \le u_1 \le\cdots\le u_n\le 1\}. To compute the marginal density of U(k)U_{(k)}, fix u(k)=xu_{(k)} = x and integrate over the other coordinates. The lower k1k-1 uniforms lie in [0,x][0, x] (volume xk1/(k1)!x^{k-1}/(k-1)!), and the upper nkn-k lie in [x,1][x, 1] (volume (1x)nk/(nk)!(1-x)^{n-k}/(n-k)!). Combining: fU(k)(x)=n!xk1(k1)!(1x)nk(nk)!=n!(k1)!(nk)!xk1(1x)nk.f_{U_{(k)}}(x) = n!\cdot\frac{x^{k-1}}{(k-1)!}\cdot\frac{(1-x)^{n-k}}{(n-k)!} = \frac{n!}{(k-1)!(n-k)!}x^{k-1}(1-x)^{n-k}. The constant n!/[(k1)!(nk)!]=1/B(k,nk+1)n!/[(k-1)!(n-k)!] = 1/B(k, n-k+1) by the relationship between the Beta function and binomial coefficients. So U(k)Beta(k,nk+1)U_{(k)}\sim\operatorname{Beta}(k,n-k+1).

Why It Matters

This gives a purely geometric origin for the Beta family that does not depend on Bayesian thinking. The Beta is the natural distribution of "the position of the kk-th of nn uniformly scattered points on the unit interval". The conjugate-prior role for the Bernoulli falls out from the same combinatorial structure: the posterior of pp given kk successes in nn trials is the predictive distribution of the next ranked position.

Failure Mode

The result requires independent uniforms with the same distribution. If the sample comes from a non-uniform continuous distribution with CDF FF, then F(X(k))F(X_{(k)}) is Beta, but X(k)X_{(k)} itself is not. If the observations are dependent, the multinomial counting argument fails because the counts below and above xx are no longer binomial. Ties also change the statement: for discrete samples there may be no unique kk-th location with a density on (0,1)(0,1).

Conjugate Prior for the Bernoulli and Binomial

Theorem

Beta-Bernoulli Conjugacy

Statement

Let X1,,XnBern(p)X_1,\dots,X_n\sim\operatorname{Bern}(p) be i.i.d., let S=XiS = \sum X_i be the total successes, and let the prior be pBeta(α0,β0)p\sim\operatorname{Beta}(\alpha_0,\beta_0). Then pX1,,XnBeta(α0+S, β0+nS).p\mid X_1,\dots,X_n\sim\operatorname{Beta}(\alpha_0 + S,\ \beta_0 + n - S). For an observation of SBin(n,p)S\sim\operatorname{Bin}(n,p) as a single sufficient summary, the same posterior holds.

Intuition

A Beta prior contributes α01\alpha_0 - 1 pseudo-successes and β01\beta_0 - 1 pseudo-failures. Real data adds real successes and real failures. The posterior is Beta with shape parameters equal to total successes plus pseudo-successes plus one (and the same for failures).

Proof Sketch

The Bernoulli likelihood is L(p)=i=1npXi(1p)1Xi=pS(1p)nS.L(p) = \prod_{i=1}^n p^{X_i}(1-p)^{1-X_i} = p^S(1-p)^{n-S}. The Beta prior density is proportional to pα01(1p)β01p^{\alpha_0-1}(1-p)^{\beta_0-1}. The posterior is proportional to the product: π(pX)pα0+S1(1p)β0+nS1,\pi(p\mid X)\propto p^{\alpha_0+S-1}(1-p)^{\beta_0+n-S-1}, which is the kernel of Beta(α0+S,β0+nS)\operatorname{Beta}(\alpha_0+S, \beta_0+n-S).

Why It Matters

Posterior mean is (α0+S)/(α0+β0+n)(\alpha_0+S)/(\alpha_0+\beta_0+n), a precision-weighted blend of the prior mean α0/(α0+β0)\alpha_0/(\alpha_0+\beta_0) and the MLE S/nS/n. The blend weight on the MLE is n/(α0+β0+n)n/(\alpha_0+\beta_0+n), which approaches one as data accumulates. The Jeffreys prior for the Bernoulli is Beta(1/2,1/2)\operatorname{Beta}(1/2, 1/2); the uniform prior is Beta(1,1)\operatorname{Beta}(1, 1); the Haldane prior is Beta(0,0)\operatorname{Beta}(0, 0) (improper). All three are admissible starting points with different bias-variance trade-offs.

Failure Mode

Conjugacy is a property of the likelihood-prior pair. With i.i.d. Bernoulli or Binomial sampling and a proper Beta prior, the posterior is Beta. If the trials are not conditionally independent given pp, if each observation has its own probability pip_i, or if the likelihood is a regression model such as logistic or probit regression, the Beta update no longer applies. The same failure occurs when the prior is improper without a proper posterior: for example, the Haldane prior with all successes or all failures leaves one posterior shape at zero.

The conjugacy calculation is short because the Bernoulli likelihood and the Beta density use the same sufficient statistic. The data enter only through SS and nSn-S. This is not an accident: the Bernoulli family is a one-parameter exponential family, and the Beta prior is the conjugate prior obtained by matching the powers of pp and 1p1-p. Once the likelihood has extra structure, such as covariates in logistic regression, the sufficient statistic is no longer just a pair of counts. The posterior cannot stay inside the two-shape Beta family because two numbers are not enough to store the information in the design matrix.

Example

Posterior predictive probability

Conjugacy also gives the next-trial predictive probability without numerical integration. If pXBeta(α0+S,β0+nS)p\mid X\sim\operatorname{Beta}(\alpha_0+S,\beta_0+n-S), then

P(Xn+1=1X1,,Xn)=E[pX]=α0+Sα0+β0+n.P(X_{n+1}=1\mid X_1,\dots,X_n) =\mathbb{E}[p\mid X] =\frac{\alpha_0+S}{\alpha_0+\beta_0+n}.

For the earlier S=6S=6, n=20n=20, Beta(8,2)\operatorname{Beta}(8,2) example, the predictive probability of success on the next trial is 14/300.46714/30\approx0.467. The plug-in MLE would use 6/20=0.36/20=0.3; the prior-only prediction would use 8/10=0.88/10=0.8. The posterior predictive value is between them because it averages over posterior uncertainty in pp instead of pretending that either the prior mean or the empirical frequency is exact.

For mm future Bernoulli trials, the predictive distribution of the future success count is Beta-Binomial, not Binomial with pp fixed at the posterior mean. The distinction is variance: integrating over pp adds uncertainty. If a model predicts the right average number of future successes but systematically understates the spread, the missing term is often this posterior uncertainty.

Connection to the Gamma Distribution

The Beta arises naturally from ratios of independent Gammas. If XGamma(α,1)X\sim\operatorname{Gamma}(\alpha,1) and YGamma(β,1)Y\sim\operatorname{Gamma}(\beta,1) are independent, then

XX+YBeta(α,β),\frac{X}{X+Y}\sim\operatorname{Beta}(\alpha,\beta),

and this ratio is independent of X+YGamma(α+β,1)X+Y\sim\operatorname{Gamma}(\alpha+\beta,1). This identity is the reason the Beta function B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) has the form it does: it is the Jacobian-adjusted ratio of Gamma normalizing constants. See gamma distribution for the Gamma side.

The same identity gives a simple way to sample from the Beta: draw XX and YY from independent Gammas (which are sums of independent Exponentials when shapes are integers) and compute the ratio.

Maximum Likelihood Estimation

The MLE of (α,β)(\alpha,\beta) from an i.i.d. Beta sample has no closed form. The log-likelihood is

(α,β)=nlogB(α,β)+(α1)logXi+(β1)log(1Xi),\ell(\alpha,\beta) = -n\log B(\alpha,\beta) + (\alpha-1)\sum\log X_i + (\beta-1)\sum\log(1-X_i),

and the score equations are

ψ(α)ψ(α+β)=logX,ψ(β)ψ(α+β)=log(1X),\psi(\alpha) - \psi(\alpha+\beta) = \overline{\log X},\qquad \psi(\beta) - \psi(\alpha+\beta) = \overline{\log(1-X)},

where ψ\psi is the digamma function and the overlines denote sample averages. These must be solved numerically (e.g., by Newton's method, with a starting point from the method-of-moments estimator). The Fisher information matrix is

I(α,β)=(ψ(α)ψ(α+β)ψ(α+β)ψ(α+β)ψ(β)ψ(α+β)).I(\alpha,\beta) = \begin{pmatrix} \psi'(\alpha) - \psi'(\alpha+\beta) & -\psi'(\alpha+\beta) \\ -\psi'(\alpha+\beta) & \psi'(\beta) - \psi'(\alpha+\beta) \end{pmatrix}.

In Bayesian workflows the MLE is rarely used; the posterior is typically the object of interest, and the Beta-Bernoulli conjugacy gives the posterior in closed form regardless of how the prior was chosen.

Method of Moments (Closed Form)

The method-of-moments estimator has a closed form. With Xˉn\bar X_n the sample mean and σ^n2\hat\sigma^2_n the sample variance,

α^MoM=Xˉn ⁣(Xˉn(1Xˉn)σ^n21),β^MoM=(1Xˉn) ⁣(Xˉn(1Xˉn)σ^n21).\hat\alpha_{\text{MoM}} = \bar X_n\!\left(\frac{\bar X_n(1-\bar X_n)}{\hat\sigma^2_n} - 1\right),\qquad \hat\beta_{\text{MoM}} = (1-\bar X_n)\!\left(\frac{\bar X_n(1-\bar X_n)}{\hat\sigma^2_n} - 1\right).

The estimator is consistent and almost always reasonable as a starting point for MLE iteration. See method of moments for the general framework.

The method-of-moments formula also names a boundary condition. A Beta distribution must satisfy Var(X)<E[X](1E[X])\operatorname{Var}(X) < \mathbb{E}[X](1-\mathbb{E}[X]). If the empirical variance violates that inequality, the plug-in expression Xˉn(1Xˉn)/σ^n21\bar X_n(1-\bar X_n)/\hat\sigma_n^2 - 1 is nonpositive, so at least one shape estimate is invalid. This can happen with data rounded to endpoints, with a misspecified sample that is not supported on [0,1][0,1], or with a mixture that is too dispersed for a single Beta component. The failure is diagnostic: the problem is not Newton iteration; the one-component Beta model is the wrong summary for the sample.

Diagnostics and Boundary Cases

The Beta family is flexible on [0,1][0,1], but it is still a two-parameter model. The first diagnostic is the mean-variance relation. If the mean is mm and the concentration is τ=α+β\tau=\alpha+\beta, then

α=mτ,β=(1m)τ,Var(X)=m(1m)τ+1.\alpha=m\tau,\qquad \beta=(1-m)\tau,\qquad \operatorname{Var}(X)=\frac{m(1-m)}{\tau+1}.

For a fixed mean, the largest possible Beta variance is approached as τ0\tau\downarrow0 and equals m(1m)m(1-m) in the limit. Any sample variance above that level cannot come from a single proper Beta distribution with that sample mean. The usual method-of-moments formula is just this relation solved backward: τ=m(1m)v1.\tau=\frac{m(1-m)}{v}-1. A negative estimate of τ\tau is not a numerical glitch; it says the observed spread is too large for the family.

Example

When a single Beta fit goes degenerate

Suppose a dataset on [0,1][0,1] has sample mean xˉ=0.5\bar x=0.5 and sample variance v^=0.24\hat v=0.24. The method-of-moments concentration is τ=0.5(10.5)0.241=0.250.2410.0417,\tau=\frac{0.5(1-0.5)}{0.24}-1=\frac{0.25}{0.24}-1\approx0.0417, so α=β=τxˉ0.021\alpha=\beta=\tau\bar x\approx0.021. That is a deep U-shaped Beta with almost all of its mass pressed against 00 and 11. A fit this close to the ceiling m(1m)=0.25m(1-m)=0.25 is a warning, not a model: the data is nearly as dispersed as a two-point mass, and a single smooth Beta density describes it poorly.

The ceiling is sharp, and it bounds the data itself, not just the Beta family. A reported sample variance above 0.250.25 at mean 0.50.5 would force τ<0\tau<0, i.e. negative shape parameters, which no Beta admits. For values genuinely confined to [0,1][0,1] that case cannot arise: Popoviciu caps the variance at (ba)2/4(b-a)^2/4 and Bhatia-Davis at (bm)(ma)(b-m)(m-a), both equal to 0.250.25 here. These ceilings bound the population (1/n1/n) variance; the unbiased (n1n-1) estimator can exceed 0.250.25 on data genuinely inside [0,1][0,1], so read v^\hat v as the 1/n1/n form throughout. Seeing τ<0\tau<0 under that convention therefore means the values are not actually in [0,1][0,1] or v^\hat v was mis-estimated. When the sample piles up at exact zeros and ones, use a zero-one-inflated model, a mixture model, or a data-generating model for the counts that produced the proportions.

Endpoint observations need separate handling. The Beta density is defined on the open interval (0,1)(0,1), so logXi\log X_i and log(1Xi)\log(1-X_i) in the likelihood are finite only when every observation lies strictly between zero and one. Exact zeros and ones are common when the data are empirical rates: a small school may have zero failures in a year; a small experiment may have all successes. Those observations are not illegal data, but they do not fit the continuous Beta likelihood without a measurement model. The usual fixes are explicit: jittering is a preprocessing convention, not a likelihood; adding counts is a Binomial model; adding point masses at zero and one is a zero-one-inflated Beta model.

Example

Exact endpoints versus small neighborhoods

Take pBeta(1/2,1/2)p\sim\operatorname{Beta}(1/2,1/2). The density is f(p)=1πp(1p).f(p)=\frac{1}{\pi\sqrt{p(1-p)}}. It diverges as p0p\downarrow0 and as p1p\uparrow1, but P(p=0)=P(p=1)=0P(p=0)=P(p=1)=0. A simulation will produce many values very close to the endpoints and no exact endpoint unless the random-number generator rounds.

That distinction changes modeling decisions. If observed records contain literal zeros because a rate was computed as 0/70/7, the underlying Bernoulli probability might still be inside (0,1)(0,1) and the count model is appropriate. If observed records contain structural zeros, such as a category that cannot occur for a subgroup, the Beta model is wrong because the atom at zero is part of the phenomenon.

The second diagnostic is whether the data-generating story is exchangeable. Beta-Bernoulli conjugacy assumes one latent pp for all trials. If the first ten observations come from one population and the next ten from another, a single posterior shape pair hides heterogeneity by averaging it away. In that case a hierarchical model is the natural extension: each group has its own pjp_j, and the group probabilities share a higher-level distribution. The ordinary Beta update is still useful locally, but it is no longer the whole model.

Choosing a Beta Prior

The practical prior-choice question is not "which Beta is neutral?" but "what mean and concentration are defensible before seeing these data?" The mean m=α/(α+β)m=\alpha/(\alpha+\beta) gives the center. The concentration τ=α+β\tau=\alpha+\beta gives the prior sample-size scale. Once those are chosen, α=mτ,β=(1m)τ.\alpha=m\tau,\qquad \beta=(1-m)\tau.

For example, suppose a baseline conversion rate is near 0.040.04, but the new experiment is different enough that the prior should be weak. Taking m=0.04m=0.04 and τ=25\tau=25 gives Beta(1,24)\operatorname{Beta}(1,24). The prior mean is 0.040.04, but after n=500n=500 visitors it receives only 25/(25+500)25/(25+500) of the posterior-mean weight. If the old experiments are closely matched to the new one, τ=500\tau=500 would encode a much stronger prior, giving Beta(20,480)\operatorname{Beta}(20,480).

This parameterization also exposes bad defaults. A uniform Beta(1,1)\operatorname{Beta}(1,1) prior has mean 0.50.5. For rare-event problems, that center is far from plausible, though the concentration is weak. Jeffreys' Beta(1/2,1/2)\operatorname{Beta}(1/2,1/2) prior is invariant under smooth reparameterization and puts more density near the endpoints. It is often a better default for pure Bernoulli inference, but it is not a substitute for domain knowledge when base rates are known.

Prior sensitivity should be reported on the scale readers care about. Do not only list α\alpha and β\beta. Report the prior mean, concentration, posterior mean after the observed counts, and predictive probability for the next trial. If two plausible priors lead to the same decision after the data, the inference is insensitive to that prior choice. If they do not, the page should say that the data are not yet strong enough to dominate the prior.

Common Confusions

Watch Out

Pseudo-counts versus real counts

A Beta(\alpha_0, \beta_0) prior is sometimes described as "equivalent to having seen α01\alpha_0 - 1 successes and β01\beta_0 - 1 failures". This intuition is right in expectation but wrong in variance: the prior also has a "concentration" α0+β0\alpha_0 + \beta_0 that real data cannot match unless the data sample is at least that large. Conjugate priors compress prior information into a sufficient statistic, but they do not replace it with imaginary observations.

Watch Out

Beta(1, 1) is the uniform distribution

The Beta(1, 1) density is f(x)=1f(x) = 1 for 0x10 \le x \le 1, which is the Uniform(0, 1) density. The Uniform is a special Beta, and the Beta is a one-parameter generalization of the Uniform whenever the shape parameters are not both one. Some Bayesian software libraries default to a Beta(1, 1) prior; this is the noninformative-uniform-prior choice, not the Jeffreys prior.

Watch Out

The mode is not the mean

For α,β>1\alpha, \beta > 1 the mode is (α1)/(α+β2)(\alpha-1)/(\alpha+\beta-2) and the mean is α/(α+β)\alpha/(\alpha+\beta). They coincide only when α=β\alpha = \beta (symmetric Beta). Bayesian MAP estimates report the mode; posterior-mean estimates report the mean. The two disagree under asymmetric priors and small sample sizes.

Watch Out

A density spike is not endpoint mass

When α<1\alpha < 1, the Beta density diverges at 00; when β<1\beta < 1, it diverges at 11. That does not mean P(X=0)>0P(X=0)>0 or P(X=1)>0P(X=1)>0. A proper Beta random variable is continuous on (0,1)(0,1) and assigns probability zero to each single point. The spike means small neighborhoods near the endpoint receive large probability relative to their width, not that the distribution has an atom at the endpoint. This distinction matters in Bayesian updates: a Beta(1/2,1/2)\operatorname{Beta}(1/2,1/2) prior can put heavy density near 00 and 11 without declaring either value certain.

Exercises

ExerciseCore

Problem

Let XBeta(3,2)X\sim\operatorname{Beta}(3, 2). Compute E[X]\mathbb{E}[X], Var(X)\operatorname{Var}(X), and the mode.

ExerciseCore

Problem

An A/B test starts with a Beta(1, 1) prior on the conversion probability pp. After serving 200 users, 36 convert. Find the posterior distribution and the posterior mean, and compare to the MLE.

ExerciseAdvanced

Problem

Let U1,,U10U_1,\dots,U_{10} be i.i.d. Unif(0,1)\operatorname{Unif}(0,1). Identify the distribution of the median U(5)U_{(5)} (with n=10n = 10) and the third-largest value U(8)U_{(8)}.

ExerciseAdvanced

Problem

Let XGamma(α,1)X\sim\operatorname{Gamma}(\alpha,1) and YGamma(β,1)Y\sim\operatorname{Gamma}(\beta,1) be independent. Show that X/(X+Y)Beta(α,β)X/(X+Y)\sim\operatorname{Beta}(\alpha,\beta), independent of X+YX+Y.

References

Canonical:

  • Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on Beta), Chapter 5 (Section 5.4 on order statistics).
  • Lehmann and Casella, Theory of Point Estimation (1998), Chapter 4 (conjugate priors).
  • Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (conjugate families).

Bayesian framing:

  • Gelman et al., Bayesian Data Analysis (2013), Chapter 2 (Section 2.4 on the Beta-Binomial model).
  • Robert, The Bayesian Choice (2007), Chapter 3.
  • Jaynes, Probability Theory: The Logic of Science (2003), Chapter 6 (uniform and Jeffreys priors for the Bernoulli).

Order statistics:

  • David and Nagaraja, Order Statistics (2003), Chapter 2 (uniform order statistics and the Beta distribution).

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.