Regression Methods
AIC and BIC
Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.
Prerequisites
Why This Matters
You have fit several models to the same data. a linear model, a quadratic model, a model with 20 features. Which one should you use? You cannot just pick the one with the highest likelihood, because more complex models always fit the training data better. This is the core problem of overfitting.
AIC and BIC are principled ways to balance goodness-of-fit against model complexity. They are fast to compute (no retraining needed), widely used, and have deep theoretical justifications.
Mental Model
Both AIC and BIC start with the log-likelihood (how well the model fits) and add a penalty for complexity (number of parameters). Both criteria require models estimated by maximum likelihood. The key difference: AIC penalizes complexity less aggressively, favoring more complex models. BIC penalizes complexity more, favoring simpler models.
This is not arbitrary. AIC and BIC are optimizing for different goals.
Definitions
Akaike Information Criterion
For a model with parameters and maximized log-likelihood :
Lower AIC is better. The first term rewards fit; the second penalizes complexity.
Bayesian Information Criterion
For a model with parameters, observations, and maximized log-likelihood :
Lower BIC is better. The penalty grows with sample size, making BIC increasingly hostile to complex models as increases.
Note: when , we have , so BIC penalizes each parameter more than AIC. In practice, is almost always much larger than 8.
Asymptotic Properties
AIC and BIC have structurally different theoretical goals:
Asymptotic Efficiency (AIC)
AIC is asymptotically efficient: among all models under consideration, the model selected by AIC minimizes the expected Kullback-Leibler divergence to the true data-generating process. In other words, AIC selects the model that predicts best.
Consistency (BIC)
BIC is consistent: if the true model is among the candidates, BIC selects the true model with probability approaching 1 as . AIC is not consistent. It tends to overfit by selecting models that are too complex, even asymptotically.
The tradeoff is clear:
- AIC: best for prediction. May select an overly complex model, but that model will predict well.
- BIC: best for identifying the true model. May select an overly simple model if is small, but will eventually find the truth.
When to Use Each
Use AIC when:
- Your goal is prediction accuracy
- You believe the true model is not in your candidate set (it almost never is)
- You prefer to err on the side of including relevant variables
Use BIC when:
- Your goal is identifying the correct model (inference, scientific discovery)
- You believe the true model might be in your candidate set
- You prefer to err on the side of parsimony
In most ML applications, prediction is the goal, so AIC (or cross-validation) is more appropriate. In scientific modeling and causal inference, BIC is often preferred.
Connection to Cross-Validation
AIC Approximates Leave-One-Out CV
Statement
Under regularity conditions, AIC is asymptotically equivalent to leave-one-out cross-validation (LOO-CV). Specifically, for models estimated by maximum likelihood, the AIC score and the LOO-CV estimate of prediction error select the same model as .
Intuition
The penalty in AIC can be derived as an estimate of the optimism. the amount by which training error underestimates test error. LOO-CV also estimates test error. Both are estimating the same quantity, so they agree asymptotically.
Proof Sketch
The key result is due to Stone (1977). The proof shows that the AIC correction equals the expected difference between training log-likelihood and test log-likelihood, up to terms. LOO-CV directly estimates test log-likelihood. Both therefore estimate the same quantity with the same leading-order behavior.
Why It Matters
This connection means AIC gives you an approximate LOO-CV estimate for free, no retraining required. For models where cross-validation is expensive (e.g., large datasets or slow training), AIC is a practical alternative.
Failure Mode
The equivalence breaks down when: (1) the model is severely misspecified, (2) is large relative to (use corrected AICc instead), or (3) the model is not estimated by maximum likelihood.
Corrected AIC (AICc)
When the number of parameters is not negligible compared to , standard AIC underpenalizes. The corrected version is:
Use AICc when as a rough rule. AICc converges to AIC as .
Worked Example
Polynomial regression model selection
You fit polynomials of degree 1 through 5 to data points. Each model has parameters (including intercept). Suppose the log-likelihoods are:
| Degree | AIC | BIC | ||
|---|---|---|---|---|
| 1 | 2 | 120.0 | 124.0 | 123.9 + 2(3.91) = 127.8 |
| 2 | 3 | 105.0 | 111.0 | 116.7 |
| 3 | 4 | 103.5 | 111.5 | 119.1 |
| 4 | 5 | 103.0 | 113.0 | 122.6 |
| 5 | 6 | 102.8 | 114.8 | 126.3 |
AIC selects degree 2 (AIC = 111.0). BIC also selects degree 2 (BIC = 116.7). Both agree that the degree-2 model best trades off fit and complexity. Note how higher-degree models barely improve the log-likelihood but accumulate penalty.
Common Confusions
AIC and BIC are not comparable across different datasets
AIC and BIC values depend on the data. You can compare AIC values across models fit to the same data, but comparing AIC values from models fit to different datasets is meaningless. The log-likelihood scale depends on and the data distribution.
Lower is better, not higher
Both AIC and BIC are formulated so that lower is better. This is because they use (negative log-likelihood), which decreases as fit improves. Some implementations negate this convention. always check.
BIC does not come from Bayesian model comparison
Despite the name, BIC is an approximation to the log marginal likelihood, but it makes strong simplifying assumptions (large , flat prior, regular model). True Bayesian model comparison uses the actual marginal likelihood, which can give very different answers.
Summary
- AIC . penalizes lightly, good for prediction
- BIC . penalizes more, good for model identification
- AIC is asymptotically efficient (best prediction); BIC is consistent (finds true model)
- AIC approximates leave-one-out cross-validation
- Use AICc when
- Lower is better for both criteria
- They can disagree. And that disagreement is informative
Exercises
Problem
For observations, how much does BIC penalize each additional parameter compared to AIC? At what sample size does BIC penalize exactly twice as much as AIC per parameter?
Problem
You have two models: Model A with parameters and , and Model B with parameters and . With observations, which model does AIC prefer? Which does BIC prefer? Interpret the disagreement.
References
Canonical:
- Akaike, "A new look at the statistical model identification" (1974)
- Schwarz, "Estimating the dimension of a model" (1978)
- Stone, "An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion" (1977)
Current:
- Burnham & Anderson, Model Selection and Multimodel Inference (2002). The definitive reference
- Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 7.5-7.7
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A