Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Joint, Marginal, and Conditional Distributions

Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.

CoreTier 1Stable~40 min

Why This Matters

Joint distribution with marginals (axes) and conditional slice (highlighted)

-2-20022XYp(Y | X in [0.8, 1.2])conditional slicemarginal p(X)marginal p(Y)

ML is about modeling relationships between variables. A classifier models P(YX)P(Y|X). A generative model learns P(X)P(X) or P(X,Y)P(X, Y). Bayesian inference uses Bayes theorem to compute posteriors. Every probabilistic model involves joint, marginal, or conditional distributions.

Core Definitions

Definition

Joint Distribution

For discrete random variables X,YX, Y: the joint PMF is p(x,y)=P(X=x,Y=y)p(x, y) = P(X = x, Y = y). For continuous random variables (see common probability distributions for standard families): the joint PDF f(x,y)f(x, y) satisfies P((X,Y)A)=Af(x,y)dxdyP((X, Y) \in A) = \iint_A f(x, y) \, dx \, dy. Both must be non-negative and sum/integrate to 1.

Definition

Marginal Distribution

The marginal of XX is obtained by summing or integrating out YY:

Discrete: pX(x)=yp(x,y)p_X(x) = \sum_y p(x, y)

Continuous: fX(x)=f(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f(x, y) \, dy

Marginalization discards information about YY.

Definition

Conditional Distribution

The conditional distribution of YY given X=xX = x is:

Discrete: p(yx)=p(x,y)pX(x)p(y|x) = \frac{p(x, y)}{p_X(x)} for pX(x)>0p_X(x) > 0

Continuous: f(yx)=f(x,y)fX(x)f(y|x) = \frac{f(x, y)}{f_X(x)} for fX(x)>0f_X(x) > 0

This defines a valid distribution over YY for each fixed xx.

Definition

Independence

Random variables XX and YY are independent (written XYX \perp Y) if and only if:

p(x,y)=pX(x)pY(y)for all x,yp(x, y) = p_X(x) \, p_Y(y) \quad \text{for all } x, y

Equivalently, p(yx)=pY(y)p(y|x) = p_Y(y) for all xx: knowing XX tells you nothing about YY.

Definition

Conditional Independence

XX and YY are conditionally independent given ZZ (written XYZX \perp Y \mid Z) if:

p(x,yz)=p(xz)p(yz)for all x,y,zp(x, y | z) = p(x|z) \, p(y|z) \quad \text{for all } x, y, z

Conditional independence is the foundation of graphical models and the naive Bayes assumption. For modeling complex dependence structures beyond independence, copulas separate the marginal behavior from the joint dependence.

Main Theorems

Theorem

Bayes Theorem

Statement

P(Y=yX=x)=P(X=xY=y)P(Y=y)P(X=x)P(Y = y \mid X = x) = \frac{P(X = x \mid Y = y) \, P(Y = y)}{P(X = x)}

In density form:

f(yx)=f(xy)fY(y)fX(x)f(y|x) = \frac{f(x|y) \, f_Y(y)}{f_X(x)}

where fX(x)=f(xy)fY(y)dyf_X(x) = \int f(x|y) f_Y(y) \, dy is the marginal (the "evidence").

Intuition

Bayes theorem inverts a conditional: it converts P(XY)P(X|Y) (the likelihood) into P(YX)P(Y|X) (the posterior) by incorporating the prior P(Y)P(Y). The denominator normalizes the result.

Proof Sketch

Start from the definition: P(YX)=P(X,Y)/P(X)P(Y|X) = P(X,Y)/P(X) and P(XY)=P(X,Y)/P(Y)P(X|Y) = P(X,Y)/P(Y). From the second equation, P(X,Y)=P(XY)P(Y)P(X,Y) = P(X|Y)P(Y). Substituting into the first gives Bayes theorem.

Why It Matters

Bayes theorem is the foundation of Bayesian inference. The prior encodes beliefs before seeing data. The likelihood connects data to parameters. The posterior combines both. In ML: Bayesian neural networks, Gaussian processes, and MAP estimation all rest on this formula.

Failure Mode

The denominator P(X=x)P(X = x) (or fX(x)f_X(x)) is often intractable to compute because it requires integrating over all possible YY. This is why approximate inference (MCMC, variational inference) is needed in practice.

Chain Rule and Law of Total Probability

Chain rule of probability for nn random variables:

P(X1,X2,,Xn)=P(X1)i=2nP(XiX1,,Xi1)P(X_1, X_2, \ldots, X_n) = P(X_1) \prod_{i=2}^{n} P(X_i \mid X_1, \ldots, X_{i-1})

This is not an assumption; it follows directly from the definition of conditional probability applied repeatedly.

Law of total probability (discrete version):

P(X=x)=yP(X=xY=y)P(Y=y)P(X = x) = \sum_y P(X = x \mid Y = y) \, P(Y = y)

This "averages out" the conditioning variable. It is used to compute marginals from conditionals and appears throughout mixture model derivations. These operations connect directly to computing expectations and moments.

Common Confusions

Watch Out

Marginal independence does not imply conditional independence

XX and YY can be marginally independent (XYX \perp Y) but conditionally dependent (X⊥̸YZX \not\perp Y \mid Z). The classic example: two independent causes X,YX, Y of a common effect ZZ. Once you observe ZZ, learning about XX tells you about YY (explaining away).

Watch Out

Conditional independence does not imply marginal independence

The converse also fails. XYZX \perp Y \mid Z does not imply XYX \perp Y. Example: ZBernoulli(0.5)Z \sim \text{Bernoulli}(0.5), X=ZX = Z, Y=ZY = Z. Then XYZX \perp Y \mid Z (both are deterministic given ZZ), but XX and YY are perfectly dependent marginally.

Watch Out

Conditioning on zero-probability events

For continuous random variables, P(X=x)=0P(X = x) = 0 for any specific xx. The conditional density f(yx)f(y|x) is defined as a ratio of densities, not as P(Y=yX=x)P(Y = y | X = x). Rigorous treatment uses the Radon-Nikodym derivative or disintegration of measures.

Canonical Examples

Example

Naive Bayes classifier

Naive Bayes assumes features are conditionally independent given the class: P(X1,,XdY)=i=1dP(XiY)P(X_1, \ldots, X_d \mid Y) = \prod_{i=1}^d P(X_i \mid Y). By Bayes theorem, P(YX)P(Y)iP(XiY)P(Y \mid X) \propto P(Y) \prod_i P(X_i \mid Y). This reduces the number of parameters from exponential in dd to linear in dd. The independence assumption is almost always false, but the classifier often works well because the decision boundary can still be correct even when the probability estimates are wrong.

Exercises

ExerciseCore

Problem

A medical test has sensitivity P(positivedisease)=0.95P(\text{positive} \mid \text{disease}) = 0.95 and specificity P(negativeno disease)=0.99P(\text{negative} \mid \text{no disease}) = 0.99. The disease prevalence is P(disease)=0.001P(\text{disease}) = 0.001. Compute P(diseasepositive)P(\text{disease} \mid \text{positive}).

ExerciseAdvanced

Problem

Let (X,Y)(X, Y) have joint density f(x,y)=2f(x, y) = 2 for 0xy10 \leq x \leq y \leq 1 and f(x,y)=0f(x, y) = 0 otherwise. Find the marginal densities fX(x)f_X(x) and fY(y)f_Y(y), and the conditional density f(yx)f(y|x).

References

Canonical:

  • Grimmett & Stirzaker, Probability and Random Processes (2020), Chapters 1-3
  • Casella & Berger, Statistical Inference (2002), Chapters 1-2

For ML context:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-2
  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 2

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics