Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Hallucination Theory

Why large language models confabulate, the mathematical frameworks for understanding when model outputs are unreliable, and what current theory says about mitigation.

AdvancedTier 1Current~55 min

Why This Matters

Language models generate fluent, confident text that is sometimes factually wrong. They cite papers that do not exist. They invent historical events. They produce plausible-looking mathematical proofs with subtle errors. This is not a bug that will be fixed with more data or bigger models. It is a structural consequence of the training objective.

Understanding why hallucination occurs, and what mathematical tools exist to detect and mitigate it, is essential for deploying AI systems responsibly.

Mental Model

An LLM is trained to predict the next token given the previous tokens. This objective rewards producing text that looks like the training distribution, not text that is true. A model that perfectly minimizes cross-entropy loss on internet text will reproduce the statistical patterns of internet text, including its errors, inconsistencies, and fabrications.

Truthfulness is not a training signal. The model has no internal mechanism to verify facts against reality. It is a very good pattern matcher that sometimes matches patterns that do not correspond to truth.

The Root Cause: Objective Mismatch

Definition

Next-Token Prediction Objective

Given a sequence of tokens x1,,xt1x_1, \ldots, x_{t-1}, the model minimizes:

L(θ)=t=1Tlogpθ(xtx1,,xt1)\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_t | x_1, \ldots, x_{t-1})

This is the cross-entropy loss against the empirical distribution of text.

The critical observation: this objective has no term for factual correctness. A model can achieve low loss by assigning high probability to fluent, grammatical, topically coherent text. even if the content is false. The training data contains both true and false statements, and the model learns to reproduce both kinds.

Types of Hallucination

Factual fabrication. The model states something false with high confidence. Example: generating a plausible-sounding biography with incorrect dates, or claiming a theorem holds when it does not.

Citation fabrication. The model produces references that do not exist. The titles and authors look plausible because they match the statistical patterns of real citations, but the specific combination is invented.

Logical inconsistency. The model produces a chain of reasoning where individual steps look correct but the overall argument is invalid. This is particularly dangerous in mathematical or legal contexts.

Entity confusion. The model conflates attributes of different entities because they appear in similar contexts in the training data.

Calibration Theory

If we cannot eliminate hallucination, can we at least know when the model is likely to be wrong?

Definition

Calibration

A model is calibrated if, among all predictions it makes with confidence pp, a fraction pp of them are correct:

P(Y=ypθ(Y=yX)=p)=pfor all p[0,1]P(Y = y \mid p_\theta(Y = y | X) = p) = p \quad \text{for all } p \in [0,1]

Calibration means the model's uncertainty estimates are meaningful.

Definition

Expected Calibration Error

The expected calibration error measures the average gap between confidence and accuracy under the predictive distribution of the model's top-class confidence C(X)=maxypθ(yX)C(X) = \max_y p_\theta(y \mid X):

ECE=EX[P(Y=Y^(X)C(X))C(X)]\text{ECE} = \mathbb{E}_{X}\bigl[\,|\, P(Y = \hat{Y}(X) \mid C(X)) - C(X) \,|\,\bigr]

where Y^(X)\hat{Y}(X) is the model's predicted label. In practice (Guo et al. 2017) this is estimated by the binned empirical version: partition the confidence interval into MM equal-width bins B1,,BMB_1, \ldots, B_M and compute

ECE^=m=1MBmnacc(Bm)conf(Bm).\widehat{\text{ECE}} = \sum_{m=1}^{M} \frac{|B_m|}{n} \bigl|\,\text{acc}(B_m) - \text{conf}(B_m)\,\bigr|.

The binned estimator is what nearly every paper reports; keep the distinction between the population quantity and the estimator in mind.

Theorem

Brier Score Calibration-Refinement Decomposition

Statement

Let pˉ(x)=P(Y=1pθ(x)=p(x))\bar{p}(x) = P(Y = 1 \mid p_\theta(x) = p(x)) denote the conditional probability of YY given the forecast value. The expected Brier score admits the DeGroot--Fienberg decomposition:

E[(pY)2]=E[(ppˉ)2]miscalibration+E[pˉ(1pˉ)]refinement (irreducible)\mathbb{E}[(p - Y)^2] = \underbrace{\mathbb{E}[(p - \bar{p})^2]}_{\text{miscalibration}} + \underbrace{\mathbb{E}[\bar{p}(1 - \bar{p})]}_{\text{refinement (irreducible)}}

A forecast is calibrated when the first term is zero (p=pˉp = \bar{p} a.s.) and sharp when it concentrates pˉ\bar{p} near 0 or 1.

Intuition

Predictive quality decomposes into two components. Calibration asks: when the model says 80%, is it right 80% of the time? Refinement asks how much signal the forecast carries about the outcome.

Why It Matters

This decomposition clarifies what temperature scaling and other post-hoc calibration methods do: they improve calibration without changing refinement. It also explains why a well-calibrated model is not necessarily a good model; you need both components.

Failure Mode

This clean additive split is specific to the Brier score. The log score (and other Bregman-divergence scoring rules) admits an analogous but different decomposition: see Gneiting and Raftery (2007), who develop a general divergence-based framework. There is no universal additive calibration-refinement split that holds for every proper scoring rule.

Are LLMs calibrated? Kadavath et al. (2022) show that pre-RLHF base models are moderately well-calibrated on multiple-choice tasks where there is a clear answer set. The picture changes after instruction tuning and RLHF: the GPT-4 technical report (OpenAI 2023) and Tian et al. (2023) document that RLHF substantially degrades calibration, producing systematically overconfident probabilities. On open-ended generation the notion of "confidence" is itself underdefined, so standard calibration diagnostics do not transfer. The softmax distribution over the vocabulary is not a reliable indicator of factual confidence in either regime.

Conformal Prediction for Uncertainty

Theorem

Conformal Prediction Coverage Guarantee

Statement

Given a calibration set of nn exchangeable examples and a non-conformity score function ss, the conformal prediction set C(xn+1)\mathcal{C}(x_{n+1}) at level 1α1-\alpha satisfies:

P(yn+1C(xn+1))1αP(y_{n+1} \in \mathcal{C}(x_{n+1})) \geq 1 - \alpha

This guarantee is distribution-free: it holds regardless of the underlying data distribution, the model, or the score function.

Intuition

Conformal prediction provides a prediction set rather than a point prediction. The set is constructed so that the true answer is included with high probability. When the model is uncertain, the set is large (many possible answers). When the model is confident and correct, the set is small.

Why It Matters

For LLMs, conformal prediction offers a principled way to quantify uncertainty without trusting the model's own probability estimates. If a model's prediction set for a factual question contains many contradictory answers, we know the model is unreliable on that query. even if the top prediction looks confident.

Failure Mode

The guarantee is marginal (averaged over all inputs), not conditional. For any specific input, the coverage could be higher or lower than 1α1-\alpha. Also, applying conformal prediction to free-form text generation is an active research area. defining a good non-conformity score for open-ended text is hard.

Retrieval-Augmented Generation

RAG mitigates hallucination by grounding generation in retrieved documents. The retrieval component relies on information retrieval foundations: inverted indexes, BM25 scoring, and reranking. Instead of relying on parametric memory (weights), the model conditions on relevant text retrieved from an external knowledge base.

Formally, instead of generating pθ(yx)p_\theta(y|x), we generate:

pθ(yx,D)where D=retrieve(x,K)p_\theta(y|x, D) \quad \text{where } D = \text{retrieve}(x, \mathcal{K})

and K\mathcal{K} is the knowledge base.

Why RAG helps: The model can copy or paraphrase from retrieved documents rather than generating from uncertain parametric knowledge. When the retriever finds the right document, hallucination drops significantly.

Why RAG is not enough: (1) The retriever can fail to find relevant documents. (2) The model can ignore retrieved context and generate from parametric memory anyway. (3) The retrieved documents themselves may contain errors. (4) The model can misinterpret or misquote the retrieved text.

Why RLHF Reduces but Does Not Eliminate Hallucination

RLHF (see the RLHF and alignment topic) trains models to produce outputs that human raters prefer. Since human raters tend to penalize obvious factual errors, RLHF reduces the rate of blatant hallucination.

But RLHF has fundamental limitations for truthfulness:

  1. Human raters cannot verify all claims. A model that produces a plausible-sounding but false statement about a technical topic may fool the rater.
  2. RLHF rewards confidence. Hedged, uncertain responses tend to be rated lower than confident ones. This creates an incentive toward overconfidence.
  3. The reward model generalizes imperfectly. Reward hacking means the model may find outputs that score highly on the reward model without being truthful.

Common Confusions

Watch Out

Hallucination is not the same as lying

Lying requires intent to deceive. LLMs have no intent. They are executing a learned next-token distribution. Hallucination is better understood as confabulation: the model fills in gaps in its knowledge with plausible patterns, the way a human with brain damage might confabulate memories. The model does not know it is wrong.

Watch Out

More data does not solve hallucination

Scaling the training set reduces some forms of hallucination (the model is less likely to be wrong about frequently-discussed topics) but cannot eliminate it. The training objective still has no truthfulness term. A model trained on all of the internet still hallucinates because the internet contains contradictions, errors, and gaps.

Watch Out

Low perplexity does not mean factual correctness

A model can achieve very low perplexity on a test set while making factual errors. Perplexity measures how well the model predicts the next token in a reference text. It does not measure whether the model can generate true statements when prompted.

Summary

  • LLMs hallucinate because the training objective (next-token prediction) does not optimize for truth
  • Types: factual fabrication, citation fabrication, logical inconsistency, entity confusion
  • Calibration measures whether model confidence correlates with correctness
  • LLMs are poorly calibrated on open-ended generation tasks
  • Conformal prediction provides distribution-free coverage guarantees but the marginal/conditional gap is important
  • RAG helps by grounding generation in retrieved text but is not a complete solution
  • RLHF reduces obvious hallucination but incentivizes overconfidence

Exercises

ExerciseCore

Problem

Explain why a model that achieves zero cross-entropy loss on a training set containing both "Paris is the capital of France" and "London is the capital of France" would reproduce both statements with high confidence.

ExerciseAdvanced

Problem

A model outputs confidence scores for 1000 factual questions. You bin the predictions into 10 equal-width bins by confidence. In the bin with average confidence 0.85, the model gets 60% of questions correct. What is the contribution of this bin to the ECE? If this pattern holds across bins, is the model overconfident or underconfident?

ExerciseResearch

Problem

Design a non-conformity score function for applying conformal prediction to a question-answering LLM. What challenges arise compared to classification?

References

Canonical:

  • DeGroot, Fienberg, "The Comparison and Evaluation of Forecasters" (1983), The Statistician. Brier score calibration-refinement decomposition.
  • Gneiting, Raftery, "Strictly Proper Scoring Rules, Prediction, and Estimation" (2007), JASA. General theory of proper scoring rules.
  • Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (2005). Conformal prediction.

Current:

  • Ji et al., "Survey of Hallucination in Natural Language Generation" (2023)
  • Angelopoulos, Bates, "Conformal Prediction: A Gentle Introduction" (2023)
  • Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020)
  • Guo et al., "On Calibration of Modern Neural Networks" (ICML 2017). ECE estimator.
  • Kadavath et al., "Language Models (Mostly) Know What They Know" (2022). Base-model MCQ calibration.
  • Tian et al., "Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from LMs Fine-Tuned with Human Feedback" (EMNLP 2023). Post-RLHF degradation.

Next Topics

The natural next steps from hallucination theory:

  • RLHF and alignment: why RLHF reduces but does not eliminate hallucination
  • Mechanistic interpretability: can we find the circuits responsible for confabulation?

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics