Joint, Marginal, and Conditional Distributions

Probability gets interesting the moment one variable is no longer enough. You do not just care about a label $Y$ or a feature vector $X$ by itself. You care about how they move together, what remains if you forget one variable, and what changes once one variable is revealed.

That is the whole joint / marginal / conditional story:

the joint tells you the full relationship
the marginal forgets one coordinate
the conditional freezes one coordinate and looks at the other

In ML, these are everywhere. A classifier models $P(Y \mid X)$ . A generative model learns $P(X)$ or $P(X,Y)$ . Bayesian inference starts from a joint model and turns it into a posterior with Bayes theorem.

Quick Version

Object	Plain meaning	Typical ML question
Joint distribution $p(x,y)$	The full map of how two variables co-occur	What feature-label pairs appear together?
Marginal $p_X(x)$	What remains after forgetting $Y$	How often do these inputs appear overall?
Conditional $p(y \mid x)$	Distribution of $Y$ once $X$ is fixed	Given this input, what label is likely?
Independence	Knowing $X$ changes nothing about $Y$	Is one variable informative about the other at all?

Want to build those mechanics by hand before reading the formal rules? Try the Probability Mechanics Lab, then come back here and read the definitions as compressed versions of the same picture.

Visual Map

Joint to conditional

One table contains the full probabilistic story

Start with the full co-occurrence table. Then forget one coordinate to get a marginal, or freeze one coordinate to get a conditional slice.

highlight column x = 1

Marginal over y

y = -1

22%

y = 0

28%

y = 1

34%

y = 2

16%

Joint table

x = -1

25%

x = 0

30%

x = 1

29%

x = 2

16%

0.14

x = -1, y = -1

0.06

x = 0, y = -1

0.02

x = 1, y = -1

0.00

x = 2, y = -1

0.08

x = -1, y = 0

0.12

x = 0, y = 0

0.06

x = 1, y = 0

0.02

x = 2, y = 0

0.03

x = -1, y = 1

0.09

x = 0, y = 1

0.14

x = 1, y = 1

0.08

x = 2, y = 1

0.00

x = -1, y = 2

0.03

x = 0, y = 2

0.07

x = 1, y = 2

0.06

x = 2, y = 2

Conditional slice

y = -1

7%

y = 0

21%

y = 1

48%

y = 2

24%

Freeze the highlighted column, then divide by its column total. That turns raw pair mass into a conditional distribution.

Joint

The full table is the joint law: $p (x, y)$ .

Marginal

Add over the forgotten variable to get $p_{X} (x)$ .

Conditional

Keep one column and renormalize it to get $p (y mi d x = 1)$ .

The visual below should be read left to right: the joint object holds all the information, marginals are what remains after summing out a variable, and the conditional slice is what remains after fixing one coordinate and renormalizing.

Core Definitions

Definition

Joint Distribution

For discrete random variables $X, Y$ : the joint PMF is $p(x, y) = P(X = x, Y = y)$ . For continuous random variables (see common probability distributions for standard families): the joint PDF $f(x, y)$ satisfies $P((X, Y) \in A) = \iint_A f(x, y) \, dx \, dy$ . Both must be non-negative and sum/integrate to 1.

Definition

Marginal Distribution

The marginal of $X$ is obtained by summing or integrating out $Y$ :

Discrete: $p_X(x) = \sum_y p(x, y)$

Continuous: $f_X(x) = \int_{-\infty}^{\infty} f(x, y) \, dy$

Marginalization discards information about $Y$ .

Definition

Conditional Distribution

The conditional distribution of $Y$ given $X = x$ is:

Discrete: $p(y|x) = \frac{p(x, y)}{p_X(x)}$ for $p_X(x) > 0$

Continuous: $f(y|x) = \frac{f(x, y)}{f_X(x)}$ for $f_X(x) > 0$

This defines a valid distribution over $Y$ for each fixed $x$ .

Definition

Independence

Random variables $X$ and $Y$ are independent (written $X \perp Y$ ) if and only if:

$p(x, y) = p_X(x) \, p_Y(y) \quad \text{for all } x, y$

Equivalently, $p(y|x) = p_Y(y)$ for all $x$ : knowing $X$ tells you nothing about $Y$ .

Definition

Conditional Independence

$X$ and $Y$ are conditionally independent given $Z$ (written $X \perp Y \mid Z$ ) if and only if:

$p(x, y \mid z) = p(x \mid z) \, p(y \mid z) \quad \text{for } P_Z\text{-almost every } z$

In the discrete / strictly-positive density setting this reduces to "for all $x, y, z$ with $p(z) > 0$ ." For general (continuous, mixed, or singular) joint laws, conditional distributions are only defined up to $P_Z$ -null sets, so the factorization is an almost-sure statement in $z$ — equivalently, $\mathbb{E}[g(X) h(Y) \mid Z] = \mathbb{E}[g(X) \mid Z]\,\mathbb{E}[h(Y) \mid Z]$ a.s. for all bounded measurable $g, h$ , or factorization through regular conditional probabilities. Conditional independence is the foundation of graphical models and the naive Bayes assumption. For modeling complex dependence structures beyond independence, copulas separate the marginal behavior from the joint dependence.

Main Theorems

Theorem

Bayes Theorem

Statement

$P(Y = y \mid X = x) = \frac{P(X = x \mid Y = y) \, P(Y = y)}{P(X = x)}$

In density form:

$f(y|x) = \frac{f(x|y) \, f_Y(y)}{f_X(x)}$

where $f_X(x) = \int f(x|y) f_Y(y) \, dy$ is the marginal (the "evidence").

Intuition

Bayes theorem inverts a conditional: it converts $P(X|Y)$ (the likelihood) into $P(Y|X)$ (the posterior) by incorporating the prior $P(Y)$ . The denominator normalizes the result.

Proof Sketch

Start from the definition: $P(Y|X) = P(X,Y)/P(X)$ and $P(X|Y) = P(X,Y)/P(Y)$ . From the second equation, $P(X,Y) = P(X|Y)P(Y)$ . Substituting into the first gives Bayes theorem.

Why It Matters

Bayes theorem is the foundation of Bayesian inference. The prior encodes beliefs before seeing data. The likelihood connects data to parameters. The posterior combines both. In ML: Bayesian neural networks, Gaussian processes, and MAP estimation all rest on this formula.

Failure Mode

The denominator $P(X = x)$ (or $f_X(x)$ ) is often intractable to compute because it requires integrating over all possible $Y$ . This is why approximate inference (MCMC, variational inference) is needed in practice.

report a correction →

Chain Rule and Law of Total Probability

Chain rule of probability for $n$ random variables:

$P(X_1, X_2, \ldots, X_n) = P(X_1) \prod_{i=2}^{n} P(X_i \mid X_1, \ldots, X_{i-1})$

This is not an assumption; it follows directly from the definition of conditional probability applied repeatedly.

Law of total probability (discrete version):

$P(X = x) = \sum_y P(X = x \mid Y = y) \, P(Y = y)$

This "averages out" the conditioning variable. It is used to compute marginals from conditionals and appears throughout mixture model derivations. These operations connect directly to computing expectations and moments.

Common Confusions

Watch Out

Marginal independence does not imply conditional independence

$X$ and $Y$ can be marginally independent ( $X \perp Y$ ) but conditionally dependent ( $X \not\perp Y \mid Z$ ). The classic example: two independent causes $X, Y$ of a common effect $Z$ . Once you observe $Z$ , learning about $X$ tells you about $Y$ (explaining away).

Watch Out

Conditional independence does not imply marginal independence

The converse also fails. $X \perp Y \mid Z$ does not imply $X \perp Y$ . Example: $Z \sim \text{Bernoulli}(0.5)$ , $X = Z$ , $Y = Z$ . Then $X \perp Y \mid Z$ (both are deterministic given $Z$ ), but $X$ and $Y$ are perfectly dependent marginally.

Watch Out

Conditioning on zero-probability events

For continuous random variables, $P(X = x) = 0$ for any specific $x$ . The conditional density $f(y|x)$ is defined as a ratio of densities, not as $P(Y = y | X = x)$ . Rigorous treatment uses the Radon-Nikodym derivative or disintegration of measures.

Canonical Examples

Example

Naive Bayes classifier

Naive Bayes assumes features are conditionally independent given the class: $P(X_1, \ldots, X_d \mid Y) = \prod_{i=1}^d P(X_i \mid Y)$ . By Bayes theorem, $P(Y \mid X) \propto P(Y) \prod_i P(X_i \mid Y)$ . This reduces the number of parameters from exponential in $d$ to linear in $d$ . The independence assumption is almost always false, but the classifier often works well because the decision boundary can still be correct even when the probability estimates are wrong.

Exercises

ExerciseCore

Problem

A medical test has sensitivity $P(\text{positive} \mid \text{disease}) = 0.95$ and specificity $P(\text{negative} \mid \text{no disease}) = 0.99$ . The disease prevalence is $P(\text{disease}) = 0.001$ . Compute $P(\text{disease} \mid \text{positive})$ .

ExerciseAdvanced

Problem

Let $(X, Y)$ have joint density $f(x, y) = 2$ for $0 \leq x \leq y \leq 1$ and $f(x, y) = 0$ otherwise. Find the marginal densities $f_X(x)$ and $f_Y(y)$ , and the conditional density $f(y|x)$ .

References

Canonical:

Blitzstein & Hwang, Introduction to Probability (2019), Chapter 7.
Billingsley, Probability and Measure (1995), Chapter 1.
Durrett, Probability: Theory and Examples (2019), Chapter 1.
Grimmett & Stirzaker, Probability and Random Processes (2020), Chapters 1-3
Casella & Berger, Statistical Inference (2002), Chapters 1-2

For ML context:

Bishop, Pattern Recognition and Machine Learning (2006), Chapters 2, 8.
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 2, 10.
Koller & Friedman, Probabilistic Graphical Models (2009), Chapters 2-4.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Kolmogorov Probability Axiomslayer 0A · tier 1
Random Variableslayer 0A · tier 1

Derived topics

5

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Bayesian Estimationlayer 0B · tier 2
Gaussian Process Regressionlayer 3 · tier 2
Gaussian Processes for Machine Learninglayer 4 · tier 3

Graph-backed continuations

Expectation, Variance, Covariance, and Moments Bayesian Estimation Gaussian Processes for Machine Learning Gaussian Process Regression The Multivariate Normal Distribution