Decision Theory Foundations

Sneiderman, Robby

Decision Theory

Decision Theory Foundations

The mathematical framework for rational choice. States, actions, consequences, Savage axioms, subjective probability, and the bridge between probability theory, utility theory, and statistical decision theory.

CoreTier 2StableReference~45 min

Prerequisites

Common Probability Distributions Bayesian Estimation Causal Inference Pearl

Start 8-question practice · 19 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

decision-theory | layer 2 | tier 2. This page has 3 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Expected Utility Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every statistical procedure is a decision. Choosing an estimator is choosing an action from a set of possible actions, where the quality of each action depends on an unknown state of nature. Decision theory makes this structure explicit and asks: what does it mean to choose well?

The framework unifies Bayesian and frequentist statistics under a common language. Bayesian inference minimizes posterior expected loss. Frequentist theory studies risk functions and admissibility. Both are special cases of the same decision-theoretic formalism introduced by Wald and axiomatized by Savage.

If you train ML models, you already make decision-theoretic commitments every time you choose a loss function. Understanding why squared error, cross-entropy, and 0-1 loss behave differently requires this framework.

The Decision Problem

Definition

Statistical Decision Problem

A statistical decision problem is a triple $(\Theta, \mathcal{A}, L)$ where:

$\Theta$ is the parameter space (states of nature). The true state $\theta \in \Theta$ is unknown.
$\mathcal{A}$ is the action space. The decision-maker selects an action $a \in \mathcal{A}$ .
$L: \Theta \times \mathcal{A} \to \mathbb{R}$ is the loss function. $L(\theta, a)$ is the cost of taking action $a$ when the true state is $\theta$ .

A decision rule (or estimator) is a function $\delta: \mathcal{X} \to \mathcal{A}$ mapping observed data to actions.

Definition

Risk Function

The risk function of a decision rule $\delta$ is the expected loss under the true parameter:

$R(\theta, \delta) = E_\theta[L(\theta, \delta(X))] = \int L(\theta, \delta(x)) \, p(x|\theta) \, dx$

The risk function measures how well $\delta$ performs, as a function of the unknown $\theta$ . No single number summarizes performance because $\theta$ is unknown.

Definition

Bayes Risk

Given a prior distribution $\pi$ on $\Theta$ , the Bayes risk of a decision rule $\delta$ is:

$r(\pi, \delta) = \int R(\theta, \delta) \, d\pi(\theta) = E_\pi[R(\theta, \delta)]$

The Bayes rule (or Bayes estimator) with respect to $\pi$ is the decision rule $\delta^*$ that minimizes the Bayes risk. Equivalently, $\delta^*$ minimizes the posterior expected loss for each observed $x$ :

$\delta^*(x) = \arg\min_a \int L(\theta, a) \, p(\theta|x) \, d\theta$

Standard Loss Functions and Their Bayes Rules

The choice of loss function determines the optimal action:

Squared error loss $L(\theta, a) = (\theta - a)^2$ : Bayes rule is the posterior mean $E[\theta|x]$ .
Absolute error loss $L(\theta, a) = |\theta - a|$ : Bayes rule is the posterior median.
0-1 loss $L(\theta, a) = \mathbf{1}(\theta \neq a)$ : Bayes rule is the posterior mode (MAP estimate).

These are not interchangeable. For a skewed posterior, the mean, median, and mode are all different, so the "best" estimator depends on how you define "best." This is not a technical nuisance; it is the central point of decision theory.

Savage's Axioms

Savage (1954) showed that if a decision-maker's preferences satisfy certain rationality axioms, then the decision-maker behaves as if maximizing expected utility with respect to a subjective probability distribution. The axioms do not assume probability exists. They derive it from preferences.

Definition

Savage Framework

The primitives are:

$S$ : a set of states of the world (what might be true)
$C$ : a set of consequences (outcomes the agent cares about)
$F$ : the set of acts (functions $f: S \to C$ )
$\succsim$ : a preference relation on acts ( $f \succsim g$ means "act $f$ is at least as good as act $g$ ")

Savage's seven axioms (P1 through P7) constrain $\succsim$ to be "rational."

The key axioms are:

P1 (Completeness and transitivity): $\succsim$ is a total preorder on $F$ .
P2 (Sure-thing principle): If $f$ and $g$ agree on event $A^c$ , then the preference between $f$ and $g$ depends only on their values on $A$ . This is the independence axiom for decisions under uncertainty.
P3 (State-independent preferences): Preferences over consequences do not depend on which state obtains.
P4 (Comparative probability): Events can be consistently ranked by "more likely than."

Theorem

Savage Representation Theorem

Statement

If the preference relation $\succsim$ on acts satisfies Savage's axioms P1 through P7, then there exist a unique probability measure $P$ on states and a utility function $u: C \to \mathbb{R}$ (unique up to positive affine transformation) such that for all acts $f, g$ :

$f \succsim g \iff \int_S u(f(s)) \, dP(s) \geq \int_S u(g(s)) \, dP(s)$

The agent acts as if maximizing subjective expected utility.

Intuition

Rationality axioms force you to behave as if you have beliefs (a probability measure) and desires (a utility function), and you combine them by computing expected utility. The probability is not assumed; it is derived from your pattern of choices. If you violate the axioms, there exist scenarios where you can be exploited by a series of bets (a Dutch book).

Proof Sketch

The proof proceeds in stages. First, use P1 through P4 to construct a qualitative probability from the preference relation: event $A$ is "more likely" than $B$ if the agent prefers betting on $A$ to betting on $B$ for any favorable consequence. Then use P6 and P7 (fine-grained partitions) to extend this to a countably additive probability measure. Finally, use the probability measure and the preference relation to construct the utility function via the Debreu representation theorem on mixture spaces.

Why It Matters

This theorem provides the foundation for Bayesian statistics and decision theory. It says: if you want to be rational (in the sense of the axioms), you must have a prior, and you must make decisions by expected utility maximization. Every violation of Bayesian decision theory corresponds to a violation of one of the axioms.

Failure Mode

The sure-thing principle (P2) is violated empirically by the Allais paradox and Ellsberg paradox. Prospect theory replaces the Savage framework with a descriptive model that accommodates these violations. The axioms also require a rich state space (no atoms too large), which fails in some simple decision problems.

report a correction →

Frequentist Decision Theory

Frequentist decision theory does not place a prior on $\Theta$ . Instead, it evaluates decision rules by their risk function $R(\theta, \delta)$ across all possible $\theta$ .

Definition

Admissibility

A decision rule $\delta$ is admissible if and only if no other rule $\delta'$ dominates it:

$\nexists \; \delta' \text{ such that } R(\theta, \delta') \leq R(\theta, \delta) \; \forall \theta, \text{ with strict inequality for some } \theta$

An inadmissible rule is uniformly worse than some other rule and should never be used.

Definition

Minimax Decision Rule

A decision rule $\delta^*$ is minimax if and only if it minimizes the worst-case risk:

$\delta^* = \arg\min_\delta \sup_\theta R(\theta, \delta)$

Minimax rules are conservative. They protect against the worst case.

Theorem

Wald Complete Class Theorem (Sketch)

Statement

Under regularity conditions, every admissible decision rule is a Bayes rule (or a limit of Bayes rules). Equivalently, the class of Bayes rules forms an essentially complete class: for every non-Bayes rule, there exists a Bayes rule that is at least as good everywhere.

Intuition

If your estimator is not Bayes for any prior, then someone can find a Bayes estimator that dominates it. This is a deep connection: even if you refuse to be Bayesian, the admissible procedures you would use are exactly the Bayesian ones (or their limits with improper priors).

Proof Sketch

The proof uses a supporting hyperplane argument. The set of achievable risk vectors $\{R(\cdot, \delta) : \delta \in \mathcal{D}\}$ is convex (by randomization). An admissible rule corresponds to a point on the lower boundary of this set. By the supporting hyperplane theorem, every such point is minimized by some linear functional, which corresponds to a prior distribution weighting the risk across $\Theta$ .

Why It Matters

The complete class theorem shows that Bayesian and frequentist approaches are not as different as textbooks suggest. Admissible frequentist procedures are Bayes procedures. The difference is in how you select which Bayes procedure to use: the Bayesian picks a prior by belief; the frequentist picks it by minimax or other criteria.

Failure Mode

The theorem requires compactness of $\Theta$ . For non-compact parameter spaces, there may be admissible rules that are not Bayes or limits of Bayes. The James-Stein estimator shows that the MLE for a multivariate normal mean in dimension $\geq 3$ is inadmissible, but the dominating estimator is a non-obvious shrinkage estimator.

report a correction →

Connections to ML

Loss function choice in ML is a decision-theoretic commitment:

Cross-entropy loss corresponds to a decision problem where the action is a predicted probability distribution and the loss penalizes KL divergence from the truth. The Bayes-optimal classifier under 0-1 loss is the MAP rule.
Squared error (MSE) in regression treats all errors symmetrically. If asymmetric costs matter (e.g., overestimating drug dosage is worse than underestimating), the loss function should reflect this.
Regularization in ML often has a Bayesian MAP interpretation: L2 penalty as a Gaussian prior, L1 as a Laplace prior. Note that MAP estimators are posterior modes, which are different in general from Bayes estimators (which minimize posterior expected loss). The complete-class theorem connects admissibility to Bayes or generalized Bayes procedures under conditions; it does not by itself imply that arbitrary regularized estimators dominate unregularized ones.

Decision Theory vs. Game Theory

Decision theory is a single-agent framework: one decision-maker facing an unknown state of nature. Nature does not play strategically. Game theory extends this to multiple strategic agents, each with their own objectives.

Minimax decision theory is the bridge. In the minimax formulation, nature is treated as an adversary choosing $\theta$ to maximize the agent's loss. This is a zero-sum game between the decision-maker and nature.

Common Confusions

Watch Out

Loss function is not the same as the objective function

The loss function $L(\theta, a)$ measures the cost of action $a$ when the true state is $\theta$ . The risk $R(\theta, \delta)$ averages the loss over data. The Bayes risk $r(\pi, \delta)$ averages the risk over priors. These are three different quantities. Minimizing loss for a given $\theta$ is trivial (just set $a = \theta$ ). The problem is that $\theta$ is unknown.

Watch Out

Admissibility is necessary but not sufficient

Being admissible means no other rule uniformly dominates you. But there can be many admissible rules, and some may be terrible in practice. Admissibility is a minimal requirement, not a recommendation. You still need additional criteria (minimax, Bayes optimality, simplicity) to select among admissible rules.

Watch Out

Minimax is not always conservative in a bad way

People dismiss minimax as overly conservative. But minimax rules are often Bayes rules with respect to the least-favorable prior. In some problems, the minimax rule is also the Bayes rule for a natural prior, so the pessimism is justified. For the normal mean problem with known variance, the sample mean is minimax under squared error loss and is generalized Bayes under the improper flat prior; under a proper normal prior the Bayes estimator is the posterior mean, which is a shrinkage estimator and recovers the sample mean only in the limit as the prior variance grows to infinity.

Exercises

ExerciseCore

Problem

A weather forecaster must predict whether it will rain (action $a = 1$ ) or not ( $a = 0$ ). The true state is $\theta \in \{0, 1\}$ . The loss function is $L(\theta, a) = c \cdot \mathbf{1}(\theta=1, a=0) + \mathbf{1}(\theta=0, a=1)$ where $c > 1$ is the cost ratio of a missed rain warning to a false alarm. Given a posterior probability $p = P(\theta=1|x)$ , what is the Bayes-optimal decision rule?

ExerciseCore

Problem

Show that under squared error loss $L(\theta, a) = (\theta - a)^2$ , the Bayes estimator with respect to prior $\pi(\theta)$ is the posterior mean $E[\theta | x]$ .

ExerciseAdvanced

Problem

The James-Stein phenomenon: Let $X \sim \mathcal{N}(\theta, I_p)$ with $\theta \in \mathbb{R}^p$ and $p \geq 3$ . The MLE is $\hat{\theta} = X$ . Show that the James-Stein estimator $\hat{\theta}_{\text{JS}} = (1 - (p-2)/\|X\|^2) X$ has strictly smaller risk than the MLE under squared error loss for all $\theta$ . What does this imply about the admissibility of the MLE?

ExerciseResearch

Problem

Consider the connection between decision theory and ML. In neural network training with cross-entropy loss, the network learns an approximation to the Bayes-optimal classifier. But the Bayes-optimal classifier assumes the true data distribution is known. In practice, we only have a finite training set. Describe the gap between the decision-theoretic Bayes risk and the empirical risk minimizer in terms of approximation error, estimation error, and optimization error. Which of these does decision theory address, and which does it ignore?

References

Canonical:

Berger, Statistical Decision Theory and Bayesian Analysis (1985), Chapters 1-2, 4-5
Savage, The Foundations of Statistics (1954), Chapters 2-5

Textbook treatments:

Lehmann and Casella, Theory of Point Estimation (2nd ed., 1998), Chapters 1, 4-5
Robert, The Bayesian Choice (2nd ed., 2007), Chapters 2-4
Ferguson, Mathematical Statistics: A Decision Theoretic Approach (1967), Chapters 1-2

Modern perspective:

Gelman et al., Bayesian Data Analysis (3rd ed., 2013), Chapter 2

Next Topics

The natural next steps from decision theory foundations:

Expected utility: the von Neumann-Morgenstern axioms and utility representation for lotteries
Game theory: extending decision theory to strategic interaction with multiple agents

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Causal Inference and the Ladder of Causationlayer 3 · tier 1
Bayesian Estimationlayer 0B · tier 2

Derived topics

4

Bounded Rationalitylayer 2 · tier 1
Game Theory Foundationslayer 2 · tier 1
Expected Utility Theorylayer 2 · tier 2
Leverage Points in Complex Systemslayer 3 · tier 2

Graph-backed continuations

Expected Utility Theory Game Theory Foundations Bounded Rationality Leverage Points in Complex Systems