Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Regression Methods

AIC and BIC

Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.

CoreTier 1Stable~40 min

Why This Matters

You have fit several models to the same data. a linear model, a quadratic model, a model with 20 features. Which one should you use? You cannot just pick the one with the highest likelihood, because more complex models always fit the training data better. This is the core problem of overfitting.

AIC and BIC are principled ways to balance goodness-of-fit against model complexity. They are fast to compute (no retraining needed), widely used, and have deep theoretical justifications.

Mental Model

Both AIC and BIC start with the log-likelihood (how well the model fits) and add a penalty for complexity (number of parameters). Both criteria require models estimated by maximum likelihood. The key difference: AIC penalizes complexity less aggressively, favoring more complex models. BIC penalizes complexity more, favoring simpler models.

This is not arbitrary. AIC and BIC are optimizing for different goals.

Definitions

Definition

Akaike Information Criterion

For a model with kk parameters and maximized log-likelihood L^\hat{L}:

AIC=2logL^+2k\text{AIC} = -2 \log \hat{L} + 2k

Lower AIC is better. The first term rewards fit; the second penalizes complexity.

Definition

Bayesian Information Criterion

For a model with kk parameters, nn observations, and maximized log-likelihood L^\hat{L}:

BIC=2logL^+klogn\text{BIC} = -2 \log \hat{L} + k \log n

Lower BIC is better. The penalty klognk \log n grows with sample size, making BIC increasingly hostile to complex models as nn increases.

Note: when n8n \geq 8, we have logn>2\log n > 2, so BIC penalizes each parameter more than AIC. In practice, nn is almost always much larger than 8.

Asymptotic Properties

AIC and BIC have structurally different theoretical goals:

Definition

Asymptotic Efficiency (AIC)

AIC is asymptotically efficient: among all models under consideration, the model selected by AIC minimizes the expected Kullback-Leibler divergence to the true data-generating process. In other words, AIC selects the model that predicts best.

Definition

Consistency (BIC)

BIC is consistent: if the true model is among the candidates, BIC selects the true model with probability approaching 1 as nn \to \infty. AIC is not consistent. It tends to overfit by selecting models that are too complex, even asymptotically.

The tradeoff is clear:

  • AIC: best for prediction. May select an overly complex model, but that model will predict well.
  • BIC: best for identifying the true model. May select an overly simple model if nn is small, but will eventually find the truth.

When to Use Each

Use AIC when:

  • Your goal is prediction accuracy
  • You believe the true model is not in your candidate set (it almost never is)
  • You prefer to err on the side of including relevant variables

Use BIC when:

  • Your goal is identifying the correct model (inference, scientific discovery)
  • You believe the true model might be in your candidate set
  • You prefer to err on the side of parsimony

In most ML applications, prediction is the goal, so AIC (or cross-validation) is more appropriate. In scientific modeling and causal inference, BIC is often preferred.

Connection to Cross-Validation

Proposition

AIC Approximates Leave-One-Out CV

Statement

Under regularity conditions, AIC is asymptotically equivalent to leave-one-out cross-validation (LOO-CV). Specifically, for models estimated by maximum likelihood, the AIC score and the LOO-CV estimate of prediction error select the same model as nn \to \infty.

Intuition

The 2k2k penalty in AIC can be derived as an estimate of the optimism. the amount by which training error underestimates test error. LOO-CV also estimates test error. Both are estimating the same quantity, so they agree asymptotically.

Proof Sketch

The key result is due to Stone (1977). The proof shows that the AIC correction 2k2k equals the expected difference between training log-likelihood and test log-likelihood, up to O(1/n)O(1/n) terms. LOO-CV directly estimates test log-likelihood. Both therefore estimate the same quantity with the same leading-order behavior.

Why It Matters

This connection means AIC gives you an approximate LOO-CV estimate for free, no retraining required. For models where cross-validation is expensive (e.g., large datasets or slow training), AIC is a practical alternative.

Failure Mode

The equivalence breaks down when: (1) the model is severely misspecified, (2) kk is large relative to nn (use corrected AICc instead), or (3) the model is not estimated by maximum likelihood.

Corrected AIC (AICc)

When the number of parameters kk is not negligible compared to nn, standard AIC underpenalizes. The corrected version is:

AICc=AIC+2k(k+1)nk1\text{AICc} = \text{AIC} + \frac{2k(k+1)}{n - k - 1}

Use AICc when n/k<40n/k < 40 as a rough rule. AICc converges to AIC as nn \to \infty.

Worked Example

Example

Polynomial regression model selection

You fit polynomials of degree 1 through 5 to n=50n = 50 data points. Each model has k=d+1k = d + 1 parameters (including intercept). Suppose the log-likelihoods are:

Degreekk2logL^-2\log\hat{L}AICBIC
12120.0124.0123.9 + 2(3.91) = 127.8
23105.0111.0116.7
34103.5111.5119.1
45103.0113.0122.6
56102.8114.8126.3

AIC selects degree 2 (AIC = 111.0). BIC also selects degree 2 (BIC = 116.7). Both agree that the degree-2 model best trades off fit and complexity. Note how higher-degree models barely improve the log-likelihood but accumulate penalty.

Common Confusions

Watch Out

AIC and BIC are not comparable across different datasets

AIC and BIC values depend on the data. You can compare AIC values across models fit to the same data, but comparing AIC values from models fit to different datasets is meaningless. The log-likelihood scale depends on nn and the data distribution.

Watch Out

Lower is better, not higher

Both AIC and BIC are formulated so that lower is better. This is because they use 2logL^-2\log\hat{L} (negative log-likelihood), which decreases as fit improves. Some implementations negate this convention. always check.

Watch Out

BIC does not come from Bayesian model comparison

Despite the name, BIC is an approximation to the log marginal likelihood, but it makes strong simplifying assumptions (large nn, flat prior, regular model). True Bayesian model comparison uses the actual marginal likelihood, which can give very different answers.

Summary

  • AIC =2logL^+2k= -2\log\hat{L} + 2k. penalizes lightly, good for prediction
  • BIC =2logL^+klogn= -2\log\hat{L} + k\log n. penalizes more, good for model identification
  • AIC is asymptotically efficient (best prediction); BIC is consistent (finds true model)
  • AIC approximates leave-one-out cross-validation
  • Use AICc when n/k<40n/k < 40
  • Lower is better for both criteria
  • They can disagree. And that disagreement is informative

Exercises

ExerciseCore

Problem

For n=100n = 100 observations, how much does BIC penalize each additional parameter compared to AIC? At what sample size does BIC penalize exactly twice as much as AIC per parameter?

ExerciseAdvanced

Problem

You have two models: Model A with k=3k = 3 parameters and logL^A=50\log\hat{L}_A = -50, and Model B with k=10k = 10 parameters and logL^B=45\log\hat{L}_B = -45. With n=200n = 200 observations, which model does AIC prefer? Which does BIC prefer? Interpret the disagreement.

References

Canonical:

  • Akaike, "A new look at the statistical model identification" (1974)
  • Schwarz, "Estimating the dimension of a model" (1978)
  • Stone, "An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion" (1977)

Current:

  • Burnham & Anderson, Model Selection and Multimodel Inference (2002). The definitive reference
  • Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 7.5-7.7

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics