Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Ethics and Fairness in ML

Fairness definitions (demographic parity, equalized odds, calibration), the impossibility theorem showing they cannot all hold simultaneously, bias sources, and mitigation strategies at each stage of the pipeline.

AdvancedTier 2Current~55 min

Why This Matters

ML systems make consequential decisions: who gets a loan, who gets bail, who gets hired, what medical treatment is recommended. These decisions often rely on classifiers whose outputs must be carefully evaluated using model evaluation techniques. If these systems discriminate on the basis of protected attributes (race, gender, age), they cause real harm at scale.

Fairness in ML is both a technical and a social problem. The technical challenge is precise: different mathematical definitions of "fairness" are provably incompatible. You cannot satisfy all of them simultaneously except in trivial cases. Choosing which definition to satisfy is a value judgment, not a mathematical one.

Fairness Definitions

Let Y^\hat{Y} be the model prediction, YY be the true outcome, and AA be a protected attribute (e.g., A{0,1}A \in \{0, 1\} for two groups).

Definition

Demographic Parity

A classifier satisfies demographic parity if the prediction is independent of the protected attribute:

P(Y^=1A=0)=P(Y^=1A=1)P(\hat{Y} = 1 \mid A = 0) = P(\hat{Y} = 1 \mid A = 1)

Both groups receive positive predictions at equal rates.

Definition

Equalized Odds

A classifier satisfies equalized odds if the prediction is independent of the protected attribute conditional on the true outcome:

P(Y^=1Y=y,A=0)=P(Y^=1Y=y,A=1)for y{0,1}P(\hat{Y} = 1 \mid Y = y, A = 0) = P(\hat{Y} = 1 \mid Y = y, A = 1) \quad \text{for } y \in \{0, 1\}

Both groups have equal true positive rates and equal false positive rates.

Definition

Calibration

A classifier satisfies calibration (or predictive parity) if the predicted probability means the same thing for both groups:

P(Y=1Y^=s,A=0)=P(Y=1Y^=s,A=1)for all sP(Y = 1 \mid \hat{Y} = s, A = 0) = P(Y = 1 \mid \hat{Y} = s, A = 1) \quad \text{for all } s

When the model says "70% chance," it should mean 70% for both groups.

The Impossibility Theorem

Theorem

Impossibility of Simultaneous Fairness

Statement

If the base rates differ between groups (P(Y=1A=0)P(Y=1A=1)P(Y = 1 \mid A = 0) \neq P(Y = 1 \mid A = 1)) and the classifier is imperfect (makes errors), then it is impossible to simultaneously satisfy demographic parity, equalized odds, and calibration.

More precisely, Chouldechova (2017) showed: for an imperfect binary classifier, if P(Y=1A=0)P(Y=1A=1)P(Y = 1 \mid A = 0) \neq P(Y = 1 \mid A = 1), then the classifier cannot simultaneously have equal false positive rates, equal false negative rates, and equal positive predictive values across both groups.

Intuition

If group A has a 30% base rate and group B has a 10% base rate, a calibrated classifier must predict higher scores for group A on average (to be accurate). But demographic parity requires equal prediction rates. These two requirements directly conflict when base rates differ.

Proof Sketch

Write out the relationship between PPV (positive predictive value), FPR (false positive rate), FNR (false negative rate), prevalence π\pi, and positive prediction rate rr:

PPV=(1FNR)πr,r=(1FNR)π+FPR(1π)\text{PPV} = \frac{(1 - \text{FNR}) \cdot \pi}{r}, \quad r = (1 - \text{FNR}) \cdot \pi + \text{FPR} \cdot (1 - \pi)

If π0π1\pi_0 \neq \pi_1 (different base rates) and we require FPR0=FPR1\text{FPR}_0 = \text{FPR}_1 and FNR0=FNR1\text{FNR}_0 = \text{FNR}_1 (equalized odds), then substituting different π\pi values gives PPV0PPV1\text{PPV}_0 \neq \text{PPV}_1 (calibration fails). Similarly for other combinations.

Why It Matters

This theorem means there is no "fair classifier" in the absolute sense. Every deployed system must choose which fairness criterion to prioritize, and this choice has ethical and political implications that mathematics alone cannot resolve. Anyone claiming their system is "fair" without specifying which definition of fairness is either confused or misleading.

Failure Mode

The impossibility vanishes in two degenerate cases: (1) the base rates are equal (P(Y=1A=0)=P(Y=1A=1)P(Y=1 \mid A=0) = P(Y=1 \mid A=1)), in which case all definitions can be simultaneously satisfied, or (2) the classifier is perfect (zero errors), in which case all definitions are trivially satisfied. Neither case holds in practice.

Sources of Bias

Historical bias. The training data reflects past discriminatory decisions. If historical hiring data shows men were hired more often, a model trained on this data will perpetuate the pattern, even if the underlying qualifications are equal.

Measurement bias. The proxy variable does not measure the concept of interest equally across groups. Using "number of arrests" as a proxy for "criminality" introduces bias because arrest rates differ by race even after controlling for behavior.

Label bias. The labels themselves are biased. In medical diagnosis, certain conditions are systematically underdiagnosed in specific populations, making the training labels less accurate for those groups.

Aggregation bias. A single model is fit to a population where subgroups have different relationships between features and outcomes. A model predicting diabetes risk that does not account for different risk profiles by ethnicity will be less accurate for minority groups.

Mitigation Strategies

Preprocessing: Fix the Data

Rebalancing. Resample or reweight training data so that both groups have equal representation and equal base rates. This directly targets demographic parity but may reduce overall accuracy.

Fair representation learning. Learn a data representation Z=g(X)Z = g(X) that retains predictive information about YY but removes information about AA. Formally, minimize I(Z;A)I(Z; A) while maximizing I(Z;Y)I(Z; Y), where II is mutual information (from information theory). This is a constrained optimization problem.

In-Processing: Constrained Optimization

Add fairness constraints directly to the training objective:

minθL(θ)subject toP(Y^=1A=0)P(Y^=1A=1)ϵ\min_\theta \mathcal{L}(\theta) \quad \text{subject to} \quad |P(\hat{Y}=1 \mid A=0) - P(\hat{Y}=1 \mid A=1)| \leq \epsilon

Proposition

Equalized Odds via Post-Processing

Statement

Given a score function s(x)s(x) and group membership AA, the equalized-odds-optimal classifier can be obtained by choosing group-specific thresholds t0,t1t_0, t_1 and randomization parameters. The optimal thresholds solve:

mint0,t1  E[(Y^,Y)]s.t.TPR0=TPR1,  FPR0=FPR1\min_{t_0, t_1} \; \mathbb{E}[\ell(\hat{Y}, Y)] \quad \text{s.t.} \quad \text{TPR}_0 = \text{TPR}_1, \; \text{FPR}_0 = \text{FPR}_1

This is a linear program over the ROC curves of each group.

Intuition

Each group has its own ROC curve (TPR vs FPR as the threshold varies). Equalized odds requires both groups to operate at the same (TPR, FPR) point. The optimal choice is the point on the intersection of feasible regions that minimizes overall error. Randomization may be needed to achieve exact equality.

Proof Sketch

The ROC curve for each group defines a set of achievable (FPR, TPR) pairs. By randomizing between two thresholds, any point on the line segment between two achievable points is also achievable. The constraint requires both groups to be at the same (FPR, TPR) point. This is a linear feasibility problem. Minimizing error subject to this constraint is a linear program.

Why It Matters

This shows that equalized odds can always be achieved by post-processing, without retraining the model. The cost is reduced accuracy (operating at a suboptimal threshold for at least one group). This quantifies the accuracy cost of fairness.

Failure Mode

Post-processing requires knowing group membership AA at prediction time, which may not be available or may be legally prohibited. The accuracy cost can be substantial when group score distributions are very different. Randomized classifiers may be unacceptable in high-stakes decisions.

Post-Processing: Adjust Thresholds

After training a score function s(x)s(x), set different classification thresholds for each group to satisfy the desired fairness criterion. For demographic parity: choose t0,t1t_0, t_1 such that P(s(x)>t0A=0)=P(s(x)>t1A=1)P(s(x) > t_0 \mid A=0) = P(s(x) > t_1 \mid A=1).

Common Confusions

Watch Out

Removing the protected attribute does not remove bias

Dropping the column AA from the feature set does not make the model fair. Other features (zip code, name, occupation) can be correlated with AA and serve as proxies. This is called "redundant encoding." A model can effectively reconstruct AA from correlated features.

Watch Out

Fairness is not accuracy

A model can be highly accurate overall while being systematically wrong for a minority group. This is a manifestation of the bias-variance tradeoff at the subgroup level. If group A is 90% of the data, a model optimized for overall accuracy will tolerate poor performance on group B. Accuracy is a population-level metric that can hide group-level disparities.

Watch Out

Equal base rates do not mean equal treatment is fair

Even if P(Y=1A=0)=P(Y=1A=1)P(Y=1 \mid A=0) = P(Y=1 \mid A=1), the model may have different error rates for each group because features have different predictive power across groups. Equalized odds is a stronger requirement than equal base rates.

Canonical Examples

Example

COMPAS recidivism prediction

The COMPAS system assigns risk scores for criminal recidivism. ProPublica (2016) showed it had higher false positive rates for Black defendants than white defendants (equalized odds violated). Northpointe (the vendor) responded that the scores were calibrated: a score of 7 meant roughly the same recidivism probability for both groups. The impossibility theorem explains why both observations can be true simultaneously. With different base rates, calibration and equal error rates cannot both hold.

Summary

  • Three main fairness definitions: demographic parity, equalized odds, calibration
  • The impossibility theorem: with different base rates, you cannot satisfy all three
  • Bias enters through data (historical, measurement, label, aggregation)
  • Mitigation at three stages: preprocessing (fix data), in-processing (constrained training), post-processing (adjust thresholds)
  • Removing the protected attribute does not ensure fairness due to proxy features
  • Choosing which fairness criterion to optimize is a value judgment, not a technical one

Exercises

ExerciseCore

Problem

Group A has base rate P(Y=1A=0)=0.5P(Y=1 \mid A=0) = 0.5 and Group B has base rate P(Y=1A=1)=0.1P(Y=1 \mid A=1) = 0.1. A classifier achieves demographic parity by predicting Y^=1\hat{Y}=1 for 30% of each group. Is this classifier calibrated? Why or why not?

ExerciseAdvanced

Problem

You have a trained risk score s(x)[0,1]s(x) \in [0, 1] and need to produce a binary classifier satisfying equalized odds. Group 0 has ROC curve with AUC 0.90 and Group 1 has ROC curve with AUC 0.75. Explain why achieving equalized odds will be more costly (in terms of overall accuracy) than if both groups had AUC 0.90.

References

Canonical:

  • Chouldechova, "Fair Prediction with Disparate Impact" (2017), Sections 1-3
  • Hardt, Price, Srebro, "Equality of Opportunity in Supervised Learning" (2016)

Current:

  • Barocas, Hardt, Narayanan, Fairness and Machine Learning (2023), Chapters 1-4
  • Mehrabi et al., "A Survey on Bias and Fairness in Machine Learning" (2021)
  • Corbett-Davies & Goel, "The Measure and Mismeasure of Fairness" (2018)

Next Topics

  • Fairness intersects with all deployment decisions in ML systems

Last reviewed: April 2026