Decision Theory
Decision Theory Foundations
The mathematical framework for rational choice. States, actions, consequences, Savage axioms, subjective probability, and the bridge between probability theory, utility theory, and statistical decision theory.
Prerequisites
Why This Matters
Every statistical procedure is a decision. Choosing an estimator is choosing an action from a set of possible actions, where the quality of each action depends on an unknown state of nature. Decision theory makes this structure explicit and asks: what does it mean to choose well?
The framework unifies Bayesian and frequentist statistics under a common language. Bayesian inference minimizes posterior expected loss. Frequentist theory studies risk functions and admissibility. Both are special cases of the same decision-theoretic formalism introduced by Wald and axiomatized by Savage.
If you train ML models, you already make decision-theoretic commitments every time you choose a loss function. Understanding why squared error, cross-entropy, and 0-1 loss behave differently requires this framework.
The Decision Problem
Statistical Decision Problem
A statistical decision problem is a triple where:
- is the parameter space (states of nature). The true state is unknown.
- is the action space. The decision-maker selects an action .
- is the loss function. is the cost of taking action when the true state is .
A decision rule (or estimator) is a function mapping observed data to actions.
Risk Function
The risk function of a decision rule is the expected loss under the true parameter:
The risk function measures how well performs, as a function of the unknown . No single number summarizes performance because is unknown.
Bayes Risk
Given a prior distribution on , the Bayes risk of a decision rule is:
The Bayes rule (or Bayes estimator) with respect to is the decision rule that minimizes the Bayes risk. Equivalently, minimizes the posterior expected loss for each observed :
Standard Loss Functions and Their Bayes Rules
The choice of loss function determines the optimal action:
- Squared error loss : Bayes rule is the posterior mean .
- Absolute error loss : Bayes rule is the posterior median.
- 0-1 loss : Bayes rule is the posterior mode (MAP estimate).
These are not interchangeable. For a skewed posterior, the mean, median, and mode are all different, so the "best" estimator depends on how you define "best." This is not a technical nuisance; it is the central point of decision theory.
Savage's Axioms
Savage (1954) showed that if a decision-maker's preferences satisfy certain rationality axioms, then the decision-maker behaves as if maximizing expected utility with respect to a subjective probability distribution. The axioms do not assume probability exists. They derive it from preferences.
Savage Framework
The primitives are:
- : a set of states of the world (what might be true)
- : a set of consequences (outcomes the agent cares about)
- : the set of acts (functions )
- : a preference relation on acts ( means "act is at least as good as act ")
Savage's seven axioms (P1 through P7) constrain to be "rational."
The key axioms are:
- P1 (Completeness and transitivity): is a total preorder on .
- P2 (Sure-thing principle): If and agree on event , then the preference between and depends only on their values on . This is the independence axiom for decisions under uncertainty.
- P3 (State-independent preferences): Preferences over consequences do not depend on which state obtains.
- P4 (Comparative probability): Events can be consistently ranked by "more likely than."
Savage Representation Theorem
Statement
If the preference relation on acts satisfies Savage's axioms P1 through P7, then there exist a unique probability measure on states and a utility function (unique up to positive affine transformation) such that for all acts :
The agent acts as if maximizing subjective expected utility.
Intuition
Rationality axioms force you to behave as if you have beliefs (a probability measure) and desires (a utility function), and you combine them by computing expected utility. The probability is not assumed; it is derived from your pattern of choices. If you violate the axioms, there exist scenarios where you can be exploited by a series of bets (a Dutch book).
Proof Sketch
The proof proceeds in stages. First, use P1 through P4 to construct a qualitative probability from the preference relation: event is "more likely" than if the agent prefers betting on to betting on for any favorable consequence. Then use P6 and P7 (fine-grained partitions) to extend this to a countably additive probability measure. Finally, use the probability measure and the preference relation to construct the utility function via the Debreu representation theorem on mixture spaces.
Why It Matters
This theorem provides the foundation for Bayesian statistics and decision theory. It says: if you want to be rational (in the sense of the axioms), you must have a prior, and you must make decisions by expected utility maximization. Every violation of Bayesian decision theory corresponds to a violation of one of the axioms.
Failure Mode
The sure-thing principle (P2) is violated empirically by the Allais paradox and Ellsberg paradox. Prospect theory replaces the Savage framework with a descriptive model that accommodates these violations. The axioms also require a rich state space (no atoms too large), which fails in some simple decision problems.
Frequentist Decision Theory
Frequentist decision theory does not place a prior on . Instead, it evaluates decision rules by their risk function across all possible .
Admissibility
A decision rule is admissible if no other rule dominates it:
An inadmissible rule is uniformly worse than some other rule and should never be used.
Minimax Decision Rule
A decision rule is minimax if it minimizes the worst-case risk:
Minimax rules are conservative. They protect against the worst case.
Wald Complete Class Theorem (Sketch)
Statement
Under regularity conditions, every admissible decision rule is a Bayes rule (or a limit of Bayes rules). Equivalently, the class of Bayes rules forms an essentially complete class: for every non-Bayes rule, there exists a Bayes rule that is at least as good everywhere.
Intuition
If your estimator is not Bayes for any prior, then someone can find a Bayes estimator that dominates it. This is a deep connection: even if you refuse to be Bayesian, the admissible procedures you would use are exactly the Bayesian ones (or their limits with improper priors).
Proof Sketch
The proof uses a supporting hyperplane argument. The set of achievable risk vectors is convex (by randomization). An admissible rule corresponds to a point on the lower boundary of this set. By the supporting hyperplane theorem, every such point is minimized by some linear functional, which corresponds to a prior distribution weighting the risk across .
Why It Matters
The complete class theorem shows that Bayesian and frequentist approaches are not as different as textbooks suggest. Admissible frequentist procedures are Bayes procedures. The difference is in how you select which Bayes procedure to use: the Bayesian picks a prior by belief; the frequentist picks it by minimax or other criteria.
Failure Mode
The theorem requires compactness of . For non-compact parameter spaces, there may be admissible rules that are not Bayes or limits of Bayes. The James-Stein estimator shows that the MLE for a multivariate normal mean in dimension is inadmissible, but the dominating estimator is a non-obvious shrinkage estimator.
Connections to ML
Loss function choice in ML is a decision-theoretic commitment:
- Cross-entropy loss corresponds to a decision problem where the action is a predicted probability distribution and the loss penalizes KL divergence from the truth. The Bayes-optimal classifier under 0-1 loss is the MAP rule.
- Squared error (MSE) in regression treats all errors symmetrically. If asymmetric costs matter (e.g., overestimating drug dosage is worse than underestimating), the loss function should reflect this.
- Regularization in ML corresponds to a prior in the decision-theoretic framework. L2 regularization is a Gaussian prior; L1 is a Laplace prior. The complete class theorem explains why regularized estimators often dominate unregularized ones.
Decision Theory vs. Game Theory
Decision theory is a single-agent framework: one decision-maker facing an unknown state of nature. Nature does not play strategically. Game theory extends this to multiple strategic agents, each with their own objectives.
Minimax decision theory is the bridge. In the minimax formulation, nature is treated as an adversary choosing to maximize the agent's loss. This is a zero-sum game between the decision-maker and nature.
Common Confusions
Loss function is not the same as the objective function
The loss function measures the cost of action when the true state is . The risk averages the loss over data. The Bayes risk averages the risk over priors. These are three different quantities. Minimizing loss for a given is trivial (just set ). The problem is that is unknown.
Admissibility is necessary but not sufficient
Being admissible means no other rule uniformly dominates you. But there can be many admissible rules, and some may be terrible in practice. Admissibility is a minimal requirement, not a recommendation. You still need additional criteria (minimax, Bayes optimality, simplicity) to select among admissible rules.
Minimax is not always conservative in a bad way
People dismiss minimax as overly conservative. But minimax rules are often Bayes rules with respect to the least-favorable prior. In some problems, the minimax rule is also the Bayes rule for a natural prior, so the pessimism is justified. The sample mean is both the minimax and Bayes estimator for a normal mean under squared error loss.
Exercises
Problem
A weather forecaster must predict whether it will rain (action ) or not (). The true state is . The loss function is where is the cost ratio of a missed rain warning to a false alarm. Given a posterior probability , what is the Bayes-optimal decision rule?
Problem
Show that under squared error loss , the Bayes estimator with respect to prior is the posterior mean .
Problem
The James-Stein phenomenon: Let with and . The MLE is . Show that the James-Stein estimator has strictly smaller risk than the MLE under squared error loss for all . What does this imply about the admissibility of the MLE?
Problem
Consider the connection between decision theory and ML. In neural network training with cross-entropy loss, the network learns an approximation to the Bayes-optimal classifier. But the Bayes-optimal classifier assumes the true data distribution is known. In practice, we only have a finite training set. Describe the gap between the decision-theoretic Bayes risk and the empirical risk minimizer in terms of approximation error, estimation error, and optimization error. Which of these does decision theory address, and which does it ignore?
References
Canonical:
- Berger, Statistical Decision Theory and Bayesian Analysis (1985), Chapters 1-2, 4-5
- Savage, The Foundations of Statistics (1954), Chapters 2-5
Textbook treatments:
- Lehmann and Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1, 4-5
- Robert, The Bayesian Choice (2nd ed., 2007), Chapters 2-4
- Ferguson, Mathematical Statistics: A Decision Theoretic Approach (1967), Chapters 1-2
Modern perspective:
- Gelman et al., Bayesian Data Analysis (3rd ed., 2013), Chapter 2
Next Topics
The natural next steps from decision theory foundations:
- Expected utility: the von Neumann-Morgenstern axioms and utility representation for lotteries
- Game theory: extending decision theory to strategic interaction with multiple agents
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Bayesian EstimationLayer 0B
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A
Builds on This
- Bounded RationalityLayer 2
- Leverage Points in Complex SystemsLayer 3