Foundations
Joint, Marginal, and Conditional Distributions
Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.
Prerequisites
Why This Matters
Joint distribution with marginals (axes) and conditional slice (highlighted)
ML is about modeling relationships between variables. A classifier models . A generative model learns or . Bayesian inference uses Bayes theorem to compute posteriors. Every probabilistic model involves joint, marginal, or conditional distributions.
Core Definitions
Joint Distribution
For discrete random variables : the joint PMF is . For continuous random variables (see common probability distributions for standard families): the joint PDF satisfies . Both must be non-negative and sum/integrate to 1.
Marginal Distribution
The marginal of is obtained by summing or integrating out :
Discrete:
Continuous:
Marginalization discards information about .
Conditional Distribution
The conditional distribution of given is:
Discrete: for
Continuous: for
This defines a valid distribution over for each fixed .
Independence
Random variables and are independent (written ) if and only if:
Equivalently, for all : knowing tells you nothing about .
Conditional Independence
and are conditionally independent given (written ) if:
Conditional independence is the foundation of graphical models and the naive Bayes assumption. For modeling complex dependence structures beyond independence, copulas separate the marginal behavior from the joint dependence.
Main Theorems
Bayes Theorem
Statement
In density form:
where is the marginal (the "evidence").
Intuition
Bayes theorem inverts a conditional: it converts (the likelihood) into (the posterior) by incorporating the prior . The denominator normalizes the result.
Proof Sketch
Start from the definition: and . From the second equation, . Substituting into the first gives Bayes theorem.
Why It Matters
Bayes theorem is the foundation of Bayesian inference. The prior encodes beliefs before seeing data. The likelihood connects data to parameters. The posterior combines both. In ML: Bayesian neural networks, Gaussian processes, and MAP estimation all rest on this formula.
Failure Mode
The denominator (or ) is often intractable to compute because it requires integrating over all possible . This is why approximate inference (MCMC, variational inference) is needed in practice.
Chain Rule and Law of Total Probability
Chain rule of probability for random variables:
This is not an assumption; it follows directly from the definition of conditional probability applied repeatedly.
Law of total probability (discrete version):
This "averages out" the conditioning variable. It is used to compute marginals from conditionals and appears throughout mixture model derivations. These operations connect directly to computing expectations and moments.
Common Confusions
Marginal independence does not imply conditional independence
and can be marginally independent () but conditionally dependent (). The classic example: two independent causes of a common effect . Once you observe , learning about tells you about (explaining away).
Conditional independence does not imply marginal independence
The converse also fails. does not imply . Example: , , . Then (both are deterministic given ), but and are perfectly dependent marginally.
Conditioning on zero-probability events
For continuous random variables, for any specific . The conditional density is defined as a ratio of densities, not as . Rigorous treatment uses the Radon-Nikodym derivative or disintegration of measures.
Canonical Examples
Naive Bayes classifier
Naive Bayes assumes features are conditionally independent given the class: . By Bayes theorem, . This reduces the number of parameters from exponential in to linear in . The independence assumption is almost always false, but the classifier often works well because the decision boundary can still be correct even when the probability estimates are wrong.
Exercises
Problem
A medical test has sensitivity and specificity . The disease prevalence is . Compute .
Problem
Let have joint density for and otherwise. Find the marginal densities and , and the conditional density .
References
Canonical:
- Grimmett & Stirzaker, Probability and Random Processes (2020), Chapters 1-3
- Casella & Berger, Statistical Inference (2002), Chapters 1-2
For ML context:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-2
- Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 2
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A