Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Decision Theory

Decision Theory Foundations

The mathematical framework for rational choice. States, actions, consequences, Savage axioms, subjective probability, and the bridge between probability theory, utility theory, and statistical decision theory.

CoreTier 2Stable~45 min

Why This Matters

Every statistical procedure is a decision. Choosing an estimator is choosing an action from a set of possible actions, where the quality of each action depends on an unknown state of nature. Decision theory makes this structure explicit and asks: what does it mean to choose well?

The framework unifies Bayesian and frequentist statistics under a common language. Bayesian inference minimizes posterior expected loss. Frequentist theory studies risk functions and admissibility. Both are special cases of the same decision-theoretic formalism introduced by Wald and axiomatized by Savage.

If you train ML models, you already make decision-theoretic commitments every time you choose a loss function. Understanding why squared error, cross-entropy, and 0-1 loss behave differently requires this framework.

The Decision Problem

Definition

Statistical Decision Problem

A statistical decision problem is a triple (Θ,A,L)(\Theta, \mathcal{A}, L) where:

  • Θ\Theta is the parameter space (states of nature). The true state θΘ\theta \in \Theta is unknown.
  • A\mathcal{A} is the action space. The decision-maker selects an action aAa \in \mathcal{A}.
  • L:Θ×ARL: \Theta \times \mathcal{A} \to \mathbb{R} is the loss function. L(θ,a)L(\theta, a) is the cost of taking action aa when the true state is θ\theta.

A decision rule (or estimator) is a function δ:XA\delta: \mathcal{X} \to \mathcal{A} mapping observed data to actions.

Definition

Risk Function

The risk function of a decision rule δ\delta is the expected loss under the true parameter:

R(θ,δ)=Eθ[L(θ,δ(X))]=L(θ,δ(x))p(xθ)dxR(\theta, \delta) = E_\theta[L(\theta, \delta(X))] = \int L(\theta, \delta(x)) \, p(x|\theta) \, dx

The risk function measures how well δ\delta performs, as a function of the unknown θ\theta. No single number summarizes performance because θ\theta is unknown.

Definition

Bayes Risk

Given a prior distribution π\pi on Θ\Theta, the Bayes risk of a decision rule δ\delta is:

r(π,δ)=R(θ,δ)dπ(θ)=Eπ[R(θ,δ)]r(\pi, \delta) = \int R(\theta, \delta) \, d\pi(\theta) = E_\pi[R(\theta, \delta)]

The Bayes rule (or Bayes estimator) with respect to π\pi is the decision rule δ\delta^* that minimizes the Bayes risk. Equivalently, δ\delta^* minimizes the posterior expected loss for each observed xx:

δ(x)=argminaL(θ,a)p(θx)dθ\delta^*(x) = \arg\min_a \int L(\theta, a) \, p(\theta|x) \, d\theta

Standard Loss Functions and Their Bayes Rules

The choice of loss function determines the optimal action:

  • Squared error loss L(θ,a)=(θa)2L(\theta, a) = (\theta - a)^2: Bayes rule is the posterior mean E[θx]E[\theta|x].
  • Absolute error loss L(θ,a)=θaL(\theta, a) = |\theta - a|: Bayes rule is the posterior median.
  • 0-1 loss L(θ,a)=1(θa)L(\theta, a) = \mathbf{1}(\theta \neq a): Bayes rule is the posterior mode (MAP estimate).

These are not interchangeable. For a skewed posterior, the mean, median, and mode are all different, so the "best" estimator depends on how you define "best." This is not a technical nuisance; it is the central point of decision theory.

Savage's Axioms

Savage (1954) showed that if a decision-maker's preferences satisfy certain rationality axioms, then the decision-maker behaves as if maximizing expected utility with respect to a subjective probability distribution. The axioms do not assume probability exists. They derive it from preferences.

Definition

Savage Framework

The primitives are:

  • SS: a set of states of the world (what might be true)
  • CC: a set of consequences (outcomes the agent cares about)
  • FF: the set of acts (functions f:SCf: S \to C)
  • \succsim: a preference relation on acts (fgf \succsim g means "act ff is at least as good as act gg")

Savage's seven axioms (P1 through P7) constrain \succsim to be "rational."

The key axioms are:

  1. P1 (Completeness and transitivity): \succsim is a total preorder on FF.
  2. P2 (Sure-thing principle): If ff and gg agree on event AcA^c, then the preference between ff and gg depends only on their values on AA. This is the independence axiom for decisions under uncertainty.
  3. P3 (State-independent preferences): Preferences over consequences do not depend on which state obtains.
  4. P4 (Comparative probability): Events can be consistently ranked by "more likely than."
Theorem

Savage Representation Theorem

Statement

If the preference relation \succsim on acts satisfies Savage's axioms P1 through P7, then there exist a unique probability measure PP on states and a utility function u:CRu: C \to \mathbb{R} (unique up to positive affine transformation) such that for all acts f,gf, g:

fg    Su(f(s))dP(s)Su(g(s))dP(s)f \succsim g \iff \int_S u(f(s)) \, dP(s) \geq \int_S u(g(s)) \, dP(s)

The agent acts as if maximizing subjective expected utility.

Intuition

Rationality axioms force you to behave as if you have beliefs (a probability measure) and desires (a utility function), and you combine them by computing expected utility. The probability is not assumed; it is derived from your pattern of choices. If you violate the axioms, there exist scenarios where you can be exploited by a series of bets (a Dutch book).

Proof Sketch

The proof proceeds in stages. First, use P1 through P4 to construct a qualitative probability from the preference relation: event AA is "more likely" than BB if the agent prefers betting on AA to betting on BB for any favorable consequence. Then use P6 and P7 (fine-grained partitions) to extend this to a countably additive probability measure. Finally, use the probability measure and the preference relation to construct the utility function via the Debreu representation theorem on mixture spaces.

Why It Matters

This theorem provides the foundation for Bayesian statistics and decision theory. It says: if you want to be rational (in the sense of the axioms), you must have a prior, and you must make decisions by expected utility maximization. Every violation of Bayesian decision theory corresponds to a violation of one of the axioms.

Failure Mode

The sure-thing principle (P2) is violated empirically by the Allais paradox and Ellsberg paradox. Prospect theory replaces the Savage framework with a descriptive model that accommodates these violations. The axioms also require a rich state space (no atoms too large), which fails in some simple decision problems.

Frequentist Decision Theory

Frequentist decision theory does not place a prior on Θ\Theta. Instead, it evaluates decision rules by their risk function R(θ,δ)R(\theta, \delta) across all possible θ\theta.

Definition

Admissibility

A decision rule δ\delta is admissible if no other rule δ\delta' dominates it:

  δ such that R(θ,δ)R(θ,δ)  θ, with strict inequality for some θ\nexists \; \delta' \text{ such that } R(\theta, \delta') \leq R(\theta, \delta) \; \forall \theta, \text{ with strict inequality for some } \theta

An inadmissible rule is uniformly worse than some other rule and should never be used.

Definition

Minimax Decision Rule

A decision rule δ\delta^* is minimax if it minimizes the worst-case risk:

δ=argminδsupθR(θ,δ)\delta^* = \arg\min_\delta \sup_\theta R(\theta, \delta)

Minimax rules are conservative. They protect against the worst case.

Theorem

Wald Complete Class Theorem (Sketch)

Statement

Under regularity conditions, every admissible decision rule is a Bayes rule (or a limit of Bayes rules). Equivalently, the class of Bayes rules forms an essentially complete class: for every non-Bayes rule, there exists a Bayes rule that is at least as good everywhere.

Intuition

If your estimator is not Bayes for any prior, then someone can find a Bayes estimator that dominates it. This is a deep connection: even if you refuse to be Bayesian, the admissible procedures you would use are exactly the Bayesian ones (or their limits with improper priors).

Proof Sketch

The proof uses a supporting hyperplane argument. The set of achievable risk vectors {R(,δ):δD}\{R(\cdot, \delta) : \delta \in \mathcal{D}\} is convex (by randomization). An admissible rule corresponds to a point on the lower boundary of this set. By the supporting hyperplane theorem, every such point is minimized by some linear functional, which corresponds to a prior distribution weighting the risk across Θ\Theta.

Why It Matters

The complete class theorem shows that Bayesian and frequentist approaches are not as different as textbooks suggest. Admissible frequentist procedures are Bayes procedures. The difference is in how you select which Bayes procedure to use: the Bayesian picks a prior by belief; the frequentist picks it by minimax or other criteria.

Failure Mode

The theorem requires compactness of Θ\Theta. For non-compact parameter spaces, there may be admissible rules that are not Bayes or limits of Bayes. The James-Stein estimator shows that the MLE for a multivariate normal mean in dimension 3\geq 3 is inadmissible, but the dominating estimator is a non-obvious shrinkage estimator.

Connections to ML

Loss function choice in ML is a decision-theoretic commitment:

  • Cross-entropy loss corresponds to a decision problem where the action is a predicted probability distribution and the loss penalizes KL divergence from the truth. The Bayes-optimal classifier under 0-1 loss is the MAP rule.
  • Squared error (MSE) in regression treats all errors symmetrically. If asymmetric costs matter (e.g., overestimating drug dosage is worse than underestimating), the loss function should reflect this.
  • Regularization in ML corresponds to a prior in the decision-theoretic framework. L2 regularization is a Gaussian prior; L1 is a Laplace prior. The complete class theorem explains why regularized estimators often dominate unregularized ones.

Decision Theory vs. Game Theory

Decision theory is a single-agent framework: one decision-maker facing an unknown state of nature. Nature does not play strategically. Game theory extends this to multiple strategic agents, each with their own objectives.

Minimax decision theory is the bridge. In the minimax formulation, nature is treated as an adversary choosing θ\theta to maximize the agent's loss. This is a zero-sum game between the decision-maker and nature.

Common Confusions

Watch Out

Loss function is not the same as the objective function

The loss function L(θ,a)L(\theta, a) measures the cost of action aa when the true state is θ\theta. The risk R(θ,δ)R(\theta, \delta) averages the loss over data. The Bayes risk r(π,δ)r(\pi, \delta) averages the risk over priors. These are three different quantities. Minimizing loss for a given θ\theta is trivial (just set a=θa = \theta). The problem is that θ\theta is unknown.

Watch Out

Admissibility is necessary but not sufficient

Being admissible means no other rule uniformly dominates you. But there can be many admissible rules, and some may be terrible in practice. Admissibility is a minimal requirement, not a recommendation. You still need additional criteria (minimax, Bayes optimality, simplicity) to select among admissible rules.

Watch Out

Minimax is not always conservative in a bad way

People dismiss minimax as overly conservative. But minimax rules are often Bayes rules with respect to the least-favorable prior. In some problems, the minimax rule is also the Bayes rule for a natural prior, so the pessimism is justified. The sample mean is both the minimax and Bayes estimator for a normal mean under squared error loss.

Exercises

ExerciseCore

Problem

A weather forecaster must predict whether it will rain (action a=1a = 1) or not (a=0a = 0). The true state is θ{0,1}\theta \in \{0, 1\}. The loss function is L(θ,a)=c1(θ=1,a=0)+1(θ=0,a=1)L(\theta, a) = c \cdot \mathbf{1}(\theta=1, a=0) + \mathbf{1}(\theta=0, a=1) where c>1c > 1 is the cost ratio of a missed rain warning to a false alarm. Given a posterior probability p=P(θ=1x)p = P(\theta=1|x), what is the Bayes-optimal decision rule?

ExerciseCore

Problem

Show that under squared error loss L(θ,a)=(θa)2L(\theta, a) = (\theta - a)^2, the Bayes estimator with respect to prior π(θ)\pi(\theta) is the posterior mean E[θx]E[\theta | x].

ExerciseAdvanced

Problem

The James-Stein phenomenon: Let XN(θ,Ip)X \sim \mathcal{N}(\theta, I_p) with θRp\theta \in \mathbb{R}^p and p3p \geq 3. The MLE is θ^=X\hat{\theta} = X. Show that the James-Stein estimator θ^JS=(1(p2)/X2)X\hat{\theta}_{\text{JS}} = (1 - (p-2)/\|X\|^2) X has strictly smaller risk than the MLE under squared error loss for all θ\theta. What does this imply about the admissibility of the MLE?

ExerciseResearch

Problem

Consider the connection between decision theory and ML. In neural network training with cross-entropy loss, the network learns an approximation to the Bayes-optimal classifier. But the Bayes-optimal classifier assumes the true data distribution is known. In practice, we only have a finite training set. Describe the gap between the decision-theoretic Bayes risk and the empirical risk minimizer in terms of approximation error, estimation error, and optimization error. Which of these does decision theory address, and which does it ignore?

References

Canonical:

  • Berger, Statistical Decision Theory and Bayesian Analysis (1985), Chapters 1-2, 4-5
  • Savage, The Foundations of Statistics (1954), Chapters 2-5

Textbook treatments:

  • Lehmann and Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1, 4-5
  • Robert, The Bayesian Choice (2nd ed., 2007), Chapters 2-4
  • Ferguson, Mathematical Statistics: A Decision Theoretic Approach (1967), Chapters 1-2

Modern perspective:

  • Gelman et al., Bayesian Data Analysis (3rd ed., 2013), Chapter 2

Next Topics

The natural next steps from decision theory foundations:

  • Expected utility: the von Neumann-Morgenstern axioms and utility representation for lotteries
  • Game theory: extending decision theory to strategic interaction with multiple agents

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics