Foundations
Beta Distribution
The Beta distribution as the conjugate prior for Bernoulli and Binomial likelihoods, as the order statistic of i.i.d. Uniforms, and as a flexible density on the unit interval: density, moments, conjugacy derivation, and MLE without closed form.
Learning position
Read this page in the graph.
foundations | layer 0A | tier 1. This page has 3 direct prerequisites and 0 published dependents.
What next
Bayesian EstimationThis is the first curated or graph-derived continuation from the current page.
Evidence badge
Source-grounded pageThis page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.
Why This Matters
The Beta distribution is the parametric family of densities on the unit interval. Two reasons it earns its own page rather than a single line in a survey:
- It is the conjugate prior for any likelihood of the form " successes in trials". Bernoulli, Binomial, and Negative Binomial all admit a Beta posterior. The update is among the cleanest in Bayesian statistics: add the number of successes to one shape and the number of failures to the other.
- The order statistics of an i.i.d. sample from are Beta distributed. The -th smallest value out of uniforms is . This geometric construction predates the Bayesian interpretation by several decades and is the cleanest way to see why the density has its specific shape.
The Beta is also the marginal of any pair of components of a Dirichlet random vector. The Dirichlet generalizes Beta to the simplex of probability vectors; Beta is the two-category special case.
Definition
Beta Distribution
A random variable has a Beta distribution with shape parameters and if its density is
where is the Beta function.
The density is supported on the open unit interval. The two shape parameters control the location and concentration of the mass: large relative to pulls the mass toward ; large relative to pulls it toward ; large concentrates the mass.
The case is the Uniform(0,1) distribution. The case is a U-shaped density with mass concentrated at both endpoints; the case is symmetric and unimodal at . Asymmetric shape parameters give skewed densities, with the mode at when both shapes exceed one.
The easiest way to read the parameters is to separate direction from concentration:
| Parameters | Mean | Shape | Mechanism |
|---|---|---|---|
| flat | no preference inside | ||
| mass near | success count dominates failure count | ||
| mass near | failure count dominates success count | ||
| tight around | large total count with balanced shapes | ||
| U-shaped | both endpoints receive singular density |
The ratio sets the center. The sum sets how hard the density resists movement away from that center. This is why and have the same mean but behave differently in a Bayesian update: the first moves after a few observations; the second needs a large sample before the posterior mean changes much.
Density and Moments
Beta Mean and Variance
Statement
For , More generally, for every positive integer .
Intuition
The mean depends only on the ratio of shapes, but the variance shrinks as grows. Two Beta densities with the same mean can have very different variances; the sum is the concentration parameter.
Proof Sketch
Direct computation: For , and , giving the mean. The variance follows by computing and subtracting the squared mean.
Why It Matters
The two parameters together control "where the mass is" (through the mean) and "how concentrated it is" (through ). In Bayesian inference, acts as a "pseudo-count" of prior observations; the larger it is, the harder it is for new data to move the posterior.
Failure Mode
The moment formulas require and . If either shape is nonpositive, is not the normalizing constant of a probability density on . For or , the density is unbounded at or , although it remains integrable. Numerical evaluation near the boundaries requires log-density arithmetic; direct evaluation of can overflow near a singular endpoint and underflow when both shapes are large.
Worked Examples
Same mean, different concentration
Compare and . Both have mean . The variances differ:
The mechanism is the denominator . Multiplying both shapes by ten leaves the mean unchanged but increases the concentration. In a Beta-Bernoulli model, behaves like a weak prior centered at ; behaves like a prior with much more resistance. They encode the same location, not the same information.
Posterior mean as a weighted average
Start with , whose prior mean is . Observe successes in Bernoulli trials. The posterior is The posterior mean is .
The same number appears from the weighted-average form:
The posterior does not average the prior density with the empirical frequency. It averages two probability estimates with weights given by prior concentration and sample size.
Uniform order statistic by counting gaps
Let be i.i.d. uniforms and consider the second-smallest value . The theorem gives , so the mean is .
The density can be read without memorizing the formula. For to land near , one observation must fall below , one must land near , and three must fall above . The number of assignments is , so This is exactly the density. The exponent of counts observations forced below the statistic; the exponent of counts observations forced above it.
Beta as a Uniform Order Statistic
Order Statistics of Uniforms Are Beta
Statement
Let be i.i.d. and let denote the -th smallest value. Then
Intuition
For to lie in a small interval near , we need uniforms to fall below it, the order statistic itself to fall near , and uniforms to fall above it. The density at is times the density of a single uniform near , which simplifies to the Beta density with and .
Proof Sketch
The joint density of is . To compute the marginal density of , fix and integrate over the other coordinates. The lower uniforms lie in (volume ), and the upper lie in (volume ). Combining: The constant by the relationship between the Beta function and binomial coefficients. So .
Why It Matters
This gives a purely geometric origin for the Beta family that does not depend on Bayesian thinking. The Beta is the natural distribution of "the position of the -th of uniformly scattered points on the unit interval". The conjugate-prior role for the Bernoulli falls out from the same combinatorial structure: the posterior of given successes in trials is the predictive distribution of the next ranked position.
Failure Mode
The result requires independent uniforms with the same distribution. If the sample comes from a non-uniform continuous distribution with CDF , then is Beta, but itself is not. If the observations are dependent, the multinomial counting argument fails because the counts below and above are no longer binomial. Ties also change the statement: for discrete samples there may be no unique -th location with a density on .
Conjugate Prior for the Bernoulli and Binomial
Beta-Bernoulli Conjugacy
Statement
Let be i.i.d., let be the total successes, and let the prior be . Then For an observation of as a single sufficient summary, the same posterior holds.
Intuition
A Beta prior contributes pseudo-successes and pseudo-failures. Real data adds real successes and real failures. The posterior is Beta with shape parameters equal to total successes plus pseudo-successes plus one (and the same for failures).
Proof Sketch
The Bernoulli likelihood is The Beta prior density is proportional to . The posterior is proportional to the product: which is the kernel of .
Why It Matters
Posterior mean is , a precision-weighted blend of the prior mean and the MLE . The blend weight on the MLE is , which approaches one as data accumulates. The Jeffreys prior for the Bernoulli is ; the uniform prior is ; the Haldane prior is (improper). All three are admissible starting points with different bias-variance trade-offs.
Failure Mode
Conjugacy is a property of the likelihood-prior pair. With i.i.d. Bernoulli or Binomial sampling and a proper Beta prior, the posterior is Beta. If the trials are not conditionally independent given , if each observation has its own probability , or if the likelihood is a regression model such as logistic or probit regression, the Beta update no longer applies. The same failure occurs when the prior is improper without a proper posterior: for example, the Haldane prior with all successes or all failures leaves one posterior shape at zero.
The conjugacy calculation is short because the Bernoulli likelihood and the Beta density use the same sufficient statistic. The data enter only through and . This is not an accident: the Bernoulli family is a one-parameter exponential family, and the Beta prior is the conjugate prior obtained by matching the powers of and . Once the likelihood has extra structure, such as covariates in logistic regression, the sufficient statistic is no longer just a pair of counts. The posterior cannot stay inside the two-shape Beta family because two numbers are not enough to store the information in the design matrix.
Posterior predictive probability
Conjugacy also gives the next-trial predictive probability without numerical integration. If , then
For the earlier , , example, the predictive probability of success on the next trial is . The plug-in MLE would use ; the prior-only prediction would use . The posterior predictive value is between them because it averages over posterior uncertainty in instead of pretending that either the prior mean or the empirical frequency is exact.
For future Bernoulli trials, the predictive distribution of the future success count is Beta-Binomial, not Binomial with fixed at the posterior mean. The distinction is variance: integrating over adds uncertainty. If a model predicts the right average number of future successes but systematically understates the spread, the missing term is often this posterior uncertainty.
Connection to the Gamma Distribution
The Beta arises naturally from ratios of independent Gammas. If and are independent, then
and this ratio is independent of . This identity is the reason the Beta function has the form it does: it is the Jacobian-adjusted ratio of Gamma normalizing constants. See gamma distribution for the Gamma side.
The same identity gives a simple way to sample from the Beta: draw and from independent Gammas (which are sums of independent Exponentials when shapes are integers) and compute the ratio.
Maximum Likelihood Estimation
The MLE of from an i.i.d. Beta sample has no closed form. The log-likelihood is
and the score equations are
where is the digamma function and the overlines denote sample averages. These must be solved numerically (e.g., by Newton's method, with a starting point from the method-of-moments estimator). The Fisher information matrix is
In Bayesian workflows the MLE is rarely used; the posterior is typically the object of interest, and the Beta-Bernoulli conjugacy gives the posterior in closed form regardless of how the prior was chosen.
Method of Moments (Closed Form)
The method-of-moments estimator has a closed form. With the sample mean and the sample variance,
The estimator is consistent and almost always reasonable as a starting point for MLE iteration. See method of moments for the general framework.
The method-of-moments formula also names a boundary condition. A Beta distribution must satisfy . If the empirical variance violates that inequality, the plug-in expression is nonpositive, so at least one shape estimate is invalid. This can happen with data rounded to endpoints, with a misspecified sample that is not supported on , or with a mixture that is too dispersed for a single Beta component. The failure is diagnostic: the problem is not Newton iteration; the one-component Beta model is the wrong summary for the sample.
Diagnostics and Boundary Cases
The Beta family is flexible on , but it is still a two-parameter model. The first diagnostic is the mean-variance relation. If the mean is and the concentration is , then
For a fixed mean, the largest possible Beta variance is approached as and equals in the limit. Any sample variance above that level cannot come from a single proper Beta distribution with that sample mean. The usual method-of-moments formula is just this relation solved backward: A negative estimate of is not a numerical glitch; it says the observed spread is too large for the family.
When a single Beta fit goes degenerate
Suppose a dataset on has sample mean and sample variance . The method-of-moments concentration is so . That is a deep U-shaped Beta with almost all of its mass pressed against and . A fit this close to the ceiling is a warning, not a model: the data is nearly as dispersed as a two-point mass, and a single smooth Beta density describes it poorly.
The ceiling is sharp, and it bounds the data itself, not just the Beta family. A reported sample variance above at mean would force , i.e. negative shape parameters, which no Beta admits. For values genuinely confined to that case cannot arise: Popoviciu caps the variance at and Bhatia-Davis at , both equal to here. These ceilings bound the population () variance; the unbiased () estimator can exceed on data genuinely inside , so read as the form throughout. Seeing under that convention therefore means the values are not actually in or was mis-estimated. When the sample piles up at exact zeros and ones, use a zero-one-inflated model, a mixture model, or a data-generating model for the counts that produced the proportions.
Endpoint observations need separate handling. The Beta density is defined on the open interval , so and in the likelihood are finite only when every observation lies strictly between zero and one. Exact zeros and ones are common when the data are empirical rates: a small school may have zero failures in a year; a small experiment may have all successes. Those observations are not illegal data, but they do not fit the continuous Beta likelihood without a measurement model. The usual fixes are explicit: jittering is a preprocessing convention, not a likelihood; adding counts is a Binomial model; adding point masses at zero and one is a zero-one-inflated Beta model.
Exact endpoints versus small neighborhoods
Take . The density is It diverges as and as , but . A simulation will produce many values very close to the endpoints and no exact endpoint unless the random-number generator rounds.
That distinction changes modeling decisions. If observed records contain literal zeros because a rate was computed as , the underlying Bernoulli probability might still be inside and the count model is appropriate. If observed records contain structural zeros, such as a category that cannot occur for a subgroup, the Beta model is wrong because the atom at zero is part of the phenomenon.
The second diagnostic is whether the data-generating story is exchangeable. Beta-Bernoulli conjugacy assumes one latent for all trials. If the first ten observations come from one population and the next ten from another, a single posterior shape pair hides heterogeneity by averaging it away. In that case a hierarchical model is the natural extension: each group has its own , and the group probabilities share a higher-level distribution. The ordinary Beta update is still useful locally, but it is no longer the whole model.
Choosing a Beta Prior
The practical prior-choice question is not "which Beta is neutral?" but "what mean and concentration are defensible before seeing these data?" The mean gives the center. The concentration gives the prior sample-size scale. Once those are chosen,
For example, suppose a baseline conversion rate is near , but the new experiment is different enough that the prior should be weak. Taking and gives . The prior mean is , but after visitors it receives only of the posterior-mean weight. If the old experiments are closely matched to the new one, would encode a much stronger prior, giving .
This parameterization also exposes bad defaults. A uniform prior has mean . For rare-event problems, that center is far from plausible, though the concentration is weak. Jeffreys' prior is invariant under smooth reparameterization and puts more density near the endpoints. It is often a better default for pure Bernoulli inference, but it is not a substitute for domain knowledge when base rates are known.
Prior sensitivity should be reported on the scale readers care about. Do not only list and . Report the prior mean, concentration, posterior mean after the observed counts, and predictive probability for the next trial. If two plausible priors lead to the same decision after the data, the inference is insensitive to that prior choice. If they do not, the page should say that the data are not yet strong enough to dominate the prior.
Common Confusions
Pseudo-counts versus real counts
A Beta(\alpha_0, \beta_0) prior is sometimes described as "equivalent to having seen successes and failures". This intuition is right in expectation but wrong in variance: the prior also has a "concentration" that real data cannot match unless the data sample is at least that large. Conjugate priors compress prior information into a sufficient statistic, but they do not replace it with imaginary observations.
Beta(1, 1) is the uniform distribution
The Beta(1, 1) density is for , which is the Uniform(0, 1) density. The Uniform is a special Beta, and the Beta is a one-parameter generalization of the Uniform whenever the shape parameters are not both one. Some Bayesian software libraries default to a Beta(1, 1) prior; this is the noninformative-uniform-prior choice, not the Jeffreys prior.
The mode is not the mean
For the mode is and the mean is . They coincide only when (symmetric Beta). Bayesian MAP estimates report the mode; posterior-mean estimates report the mean. The two disagree under asymmetric priors and small sample sizes.
A density spike is not endpoint mass
When , the Beta density diverges at ; when , it diverges at . That does not mean or . A proper Beta random variable is continuous on and assigns probability zero to each single point. The spike means small neighborhoods near the endpoint receive large probability relative to their width, not that the distribution has an atom at the endpoint. This distinction matters in Bayesian updates: a prior can put heavy density near and without declaring either value certain.
Exercises
Problem
Let . Compute , , and the mode.
Problem
An A/B test starts with a Beta(1, 1) prior on the conversion probability . After serving 200 users, 36 convert. Find the posterior distribution and the posterior mean, and compare to the MLE.
Problem
Let be i.i.d. . Identify the distribution of the median (with ) and the third-largest value .
Problem
Let and be independent. Show that , independent of .
References
Canonical:
- Casella and Berger, Statistical Inference (2002), Chapter 3 (Section 3.3 on Beta), Chapter 5 (Section 5.4 on order statistics).
- Lehmann and Casella, Theory of Point Estimation (1998), Chapter 4 (conjugate priors).
- Bickel and Doksum, Mathematical Statistics, Volume I (2015), Chapter 1 (conjugate families).
Bayesian framing:
- Gelman et al., Bayesian Data Analysis (2013), Chapter 2 (Section 2.4 on the Beta-Binomial model).
- Robert, The Bayesian Choice (2007), Chapter 3.
- Jaynes, Probability Theory: The Logic of Science (2003), Chapter 6 (uniform and Jeffreys priors for the Bernoulli).
Order statistics:
- David and Nagaraja, Order Statistics (2003), Chapter 2 (uniform order statistics and the Beta distribution).
Last reviewed: May 11, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Common Probability Distributionslayer 0A · tier 1
- Distributions Atlaslayer 0A · tier 1
- Gamma Distributionlayer 0A · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.