Hallucination Theory

Q: Why do LLMs hallucinate?

Two reasons. First, a statistical lower bound: Kalai-Nachum-Vempala-Zhang (2025) prove the base-model error rate on rare facts is at least the singleton rate, the fraction of facts that appear exactly once in training. Second, an incentive equilibrium: every standard benchmark scores abstention as wrong, so post-training optimizes for confident bluffing rather than honest uncertainty.

Q: Will scaling fix hallucinations?

No, not entirely. Scaling reduces the singleton rate by repeating more facts, but the long tail of rare facts persists at any finite training set. The deeper issue: scaling does not change the benchmark incentive structure that rewards bluffing. Both halves of the problem need to move.

Q: Does RLHF make hallucinations better or worse?

It can improve preferred behavior while still hurting calibration. RLHF fine-tunes against human preference signals; if humans or learned reward models reward confident answers more than hedged ones, the policy can learn to bluff. The fix is to include explicit calibration objectives or abstention rewards, not assume preference data captures factual uncertainty.

Q: How is conformal prediction useful here?

Conformal prediction gives distribution-free coverage guarantees: a confidence set with provable probability of containing the true answer. It does not reduce hallucination per se, but it lets a deployed system fail safely by abstaining when the conformal set is too large. It addresses the symptom (deploy-time consequences) rather than the cause.

Q: What problems can calibration not solve?

Structural failures. The reversal curse (knowing 'A is B' does not imply knowing 'B is A'), snowballing within a single chain (one early error compounds across reasoning steps), and long-tail entity confusion are not calibration problems. Even a perfectly calibrated model exhibits them. These need retrieval, tool use, or architectural changes.

Sneiderman, Robby

LLM Construction

Hallucination Theory

Why language models confabulate, framed by the Kalai-Nachum-Vempala-Zhang (2025) statistical lower bound on pretraining error and the evaluation-incentive equilibrium that prevents post-training from removing it. Calibration, conformal prediction, structural failure modes (reversal curse, snowballing), and the measurement gap (FActScore, SAFE).

AdvancedTier 1CurrentSupporting~80 min

Prerequisites

Empirical Risk Minimization Transformer Architecture Calibration and Uncertainty RLHF and Alignment

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 1. This page has 4 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Language models produce fluent, confident, and sometimes false text. This is not a defect that scale, data quality, or alignment will quietly remove. Kalai, Nachum, Vempala, and Zhang (2025) give the first sharp statistical reason: the rate at which a base language model emits confident wrong answers on rare facts is bounded below by the fraction of facts in its training corpus that appear exactly once. Post-training does not fix the problem because every standard benchmark scores abstention as wrong, so the policy that maximizes benchmark score is the one that bluffs.

The page makes both halves of this argument precise, then organizes the rest of the hallucination literature around them: when calibration helps and when it cannot, where conformal prediction earns its place, and the structural failure modes (reversal curse, snowballing, long-tail entities) that are not calibration phenomena at all.

Singleton-rate lower bound on hallucination (Kalai et al. 2025, Theorem 2)

Evidence Boundary

This page makes three different kinds of claim, and they need different evidence.

Claim type	Primary support	Boundary
Statistical floor	Kalai-Nachum-Vempala-Zhang's IIV reduction and singleton-rate theorem	It is a lower-bound frame for the base-model setting, not a complete model of every hallucination mode
Long-tail factuality	PopQA and retrieval-vs-parametric-memory results	It supports the claim that retrieval helps rare entities; it does not prove any one deployed RAG stack is reliable
Post-training calibration	Tian et al. and related system-card evidence on RLHF calibration degradation	It supports a reward-shape mechanism for overconfidence; it does not say every RLHF run worsens every factual metric
Free-form factuality measurement	FActScore, SAFE, and LongFact	These are operational metrics for atomic fact support, not direct access to truth in every domain

This boundary is the bridge to RLHF and alignment: post-training can improve preferred behavior, refusal style, and some evaluation slices, but it does not by itself remove the statistical floor or change the benchmark incentives that reward confident guessing.

The Statistical Origin: A Lower Bound on Pretraining Error

Pretraining minimizes cross-entropy on a corpus drawn from a true distribution $p$ . The model $\hat{p}$ is then used generatively: sample tokens, return text. A "hallucination" is a confidently sampled output that lies outside the valid set $\mathcal{V}$ of correct responses to the prompt. The first question is whether any base model, however well trained, can drive this rate to zero.

Kalai et al. answer no, and give a quantitative bound by reducing generation to binary classification.

Definition

The Is-It-Valid (IIV) reduction $I I V$

For prompt $c$ , let $\mathcal{V}_c$ be the set of valid responses and $\mathcal{E}_c$ be a set of plausible-but-wrong "error" responses. Define the binary classifier $\hat{f}(x) = +$ iff $\hat{p}(x) > 1/|\mathcal{E}_c|$ , that is, iff the generative model assigns the response above-uniform mass relative to the error set. The IIV problem is to predict whether $x \in \mathcal{V}_c$ given a sample $x$ drawn from a 50/50 mixture of valid responses and uniform errors. A density-estimation model induces an IIV classifier; bounding generation error in terms of IIV error is the path from "good density estimator" to "few hallucinations."

Theorem

Generation-to-classification reduction (Kalai-Nachum-Vempala-Zhang 2025, Theorem 1)

Statement

The base model's generation error rate $\mathrm{err}$ (probability of sampling an invalid response) satisfies

\mathrm{err} \;\geq\; 2 \cdot \mathrm{err}_{\mathrm{IIV}} \;-\; \frac{\max_c |\mathcal{V}_c|}{\min_c |\mathcal{E}_c|} \;-\; \delta,

where $\mathrm{err}_{\mathrm{IIV}}$ is the binary classification error of the induced IIV classifier and $\delta$ is a calibration term.

Intuition

A model that generates valid text well must, in particular, distinguish valid from invalid responses well. The bound makes the converse precise: generation error is at least twice classification error, modulo a ratio of valid-to-error set sizes and a calibration slack. There is no way to be a perfect generator without being a good classifier; classifier error sets a floor on generation error.

Proof Sketch

The argument is a direct calculation. The IIV classifier predicts $+$ exactly when $\hat{p}(x) > 1/|\mathcal{E}_c|$ . Errors split into false positives (invalid $x$ assigned high mass) and false negatives (valid $x$ assigned low mass). False positives lower-bound the probability that the generator emits an invalid response with above-uniform mass; false negatives lower-bound the probability that it under-samples valid responses. Summing the two contributions gives $2 \cdot \mathrm{err}_{\mathrm{IIV}}$ . The subtracted term $\max_c|\mathcal{V}_c|/\min_c|\mathcal{E}_c|$ accounts for prompts where the valid set is unusually large relative to the error set (so a uniform generator would already classify well). The $\delta$ term absorbs the gap between $\hat{p}$ and the true marginal $p$ on the valid set.

Why It Matters

This converts an unsupervised question (does my generative model hallucinate?) into a supervised one (can a classifier separate valid from invalid completions?). Whatever lower bound applies to IIV classification immediately applies to generation. The next theorem turns this into a concrete bound that depends only on the corpus statistics.

Failure Mode

The bound is loose when the calibration term $\delta$ is large or when the valid set is comparable in size to the error set. It is meant as a floor on hallucination rate under reasonable conditions, not a tight characterization. The interesting case is when $\mathcal{E}_c$ is much larger than $\mathcal{V}_c$ (most prompts have a few correct answers and many plausible-looking wrong ones), where the subtracted ratio is negligible.

report a correction →

The reduction is a setup. The payoff is the next bound, which says you can lower-bound hallucination rate from properties of the training corpus alone, before training ever runs.

Definition

Singleton rate $sr$

A prompt $c$ is a singleton in a training corpus if and only if exactly one response to $c$ appears in the corpus. The singleton rate $\mathrm{sr}$ is the fraction of prompts in the corpus that are singletons. Intuitively, singletons are facts the model has seen exactly one example of; the corpus gives no redundancy and no triangulation. Kalai et al. give the precise definition in Section 3.3.1.

Theorem

Singleton-rate lower bound on hallucination (Kalai-Nachum-Vempala-Zhang 2025, Theorem 2)

Statement

The base model's generation error rate satisfies

\mathrm{err} \;\geq\; \mathrm{sr} \;-\; \frac{2}{\min_c |\mathcal{E}_c|} \;-\; \frac{35 + 6 \ln N}{\sqrt{N}} \;-\; \delta,

where $\mathrm{sr}$ is the singleton rate of the corpus and $N$ is its size.

Intuition

If a fact appears exactly once in the training corpus, the model has no statistical lever to distinguish "rare-but-true" from "rare-and-wrong." The best it can do on singletons is guess. The fraction of guessable prompts is the singleton rate, and generation error on the corpus inherits this floor. The other terms shrink with corpus size $N$ ; the singleton rate does not.

Proof Sketch

Combine the IIV reduction (Theorem 1) with a counting argument on singletons. For singleton prompts the IIV classifier has at-chance accuracy because the empirical distribution gives no signal beyond a single sample. So $\mathrm{err}_{\mathrm{IIV}} \geq \mathrm{sr}/2 - O(1/\sqrt{N})$ by a uniform-convergence argument on the empirical measure. Plugging into Theorem 1 and rearranging gives the stated bound, with the $(35 + 6 \ln N)/\sqrt{N}$ term coming from a standard generalization bound (the explicit constants are derived in the paper).

Why It Matters

Three immediate consequences.

First, scaling the corpus reduces all terms except $\mathrm{sr}$ and $\delta$ . The singleton-rate floor is a property of how facts are distributed in the world, not how much text you collect. World knowledge has a long tail: most named entities, most events, and most academic results appear in only a handful of sources. As you scrape more text, you discover more singletons at roughly the same rate, so $\mathrm{sr}$ does not vanish.

Second, the bound is on the base model's error, before any post-training. RLHF, Constitutional AI, and DPO act on a fixed pretrained model; they cannot lower the singleton-rate floor by themselves.

Third, the bound is unconditional in model architecture. A trillion-parameter transformer trained on the same corpus inherits the same floor. Capacity is not the bottleneck; data redundancy is.

Failure Mode

The bound is not tight on every slice of the data. Hallucination on frequent facts (entities that appear many times, well-tested benchmarks) can be much lower than $\mathrm{sr}$ because the relevant prompts are not singletons. The bound concerns the average over the corpus distribution; subset-level rates can be far smaller or larger. The bound also assumes the IIV setup with a discrete error set; it does not directly apply to continuous outputs (regression, ranking).

report a correction →

The empirical complement of this theorem is Mallen et al. (ACL 2023), who built PopQA: 14,000 entity questions stratified by Wikipedia popularity. They report that LM accuracy on long-tail entities barely improves with scale, while retrieval-augmented models dominate on the same slice. The shape of their plot is what the singleton-rate bound predicts: scaling fixes the head of the distribution, not the tail.

Calibration Theory: What It Buys and What It Does Not

If we cannot eliminate hallucination, can we at least know when the model is wrong? Calibration is the formal version of this question. A calibrated model whose top- $1$ confidence is $0.7$ is correct $70\%$ of the time among predictions of confidence $0.7$ .

Definition

Calibration $c a l ib r a t i o n$

A predictor $\hat{p}_\theta$ is calibrated if and only if for all $p \in [0,1]$ ,

\Pr(Y = y \mid \hat{p}_\theta(Y = y \mid X) = p) = p.

Calibration says the model's stated confidence equals its empirical accuracy at that confidence. It is a marginal property over the joint distribution of $(X, Y)$ , not a guarantee about any particular prompt.

Definition

Expected Calibration Error $E C E$

With top-class confidence $C(X) = \max_y \hat{p}_\theta(y \mid X)$ and prediction $\hat{Y}(X) = \arg\max_y \hat{p}_\theta(y \mid X)$ , the population ECE is

\mathrm{ECE} = \mathbb{E}_X \bigl[\,| \Pr(Y = \hat{Y}(X) \mid C(X)) - C(X) | \,\bigr].

The standard binned estimator (Guo et al. 2017) partitions the confidence interval into $M$ bins $B_1, \ldots, B_M$ and computes

\widehat{\mathrm{ECE}} = \sum_{m=1}^M \frac{|B_m|}{n} \,\bigl|\, \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \,\bigr|.

Almost every paper reports the binned estimator. It is not the same object as the population ECE; binning bias is well documented (Nixon et al. 2019, Roelofs et al. 2022).

Theorem

Brier score calibration-refinement decomposition (DeGroot-Fienberg 1983)

Statement

Let $\bar{p}(x) = \Pr(Y = 1 \mid \hat{p}_\theta(x) = p(x))$ be the conditional probability of $Y$ given the forecast value. Then

\mathbb{E}[(p - Y)^2] \;=\; \underbrace{\mathbb{E}[(p - \bar{p})^2]}_{\text{miscalibration}} \;+\; \underbrace{\mathbb{E}[\bar{p}(1 - \bar{p})]}_{\text{refinement (irreducible)}}.

The forecast is calibrated when the first term vanishes ( $p = \bar{p}$ a.s.) and sharp when $\bar{p}$ concentrates near $0$ or $1$ .

Intuition

Predictive quality decomposes into two independent components. Calibration asks whether the model's $80\%$ confidence is matched by $80\%$ accuracy. Refinement asks how much signal the model's forecast carries about the outcome. A constant predictor that always says $\Pr(Y=1) = \pi$ where $\pi$ is the base rate is perfectly calibrated and useless: refinement is zero. A discriminating predictor that always assigns $0$ or $1$ is sharp; whether it is calibrated depends on how often it is right.

Why It Matters

This decomposition explains what temperature scaling and Platt scaling do: they fix calibration without changing refinement. It also explains why a calibrated model is not necessarily a good model. Both components matter. For LLMs the decomposition makes a sharper claim: post-RLHF degradation of calibration (Tian et al. 2023) does not necessarily reduce refinement, so the underlying capability is preserved while the uncertainty signal is destroyed. The fix is to recover calibration without retraining the policy.

Failure Mode

The clean additive split is specific to the Brier score. The log score and other Bregman-divergence proper scoring rules admit analogous but different decompositions (Gneiting and Raftery 2007). There is no universal additive calibration-refinement split that holds for every proper scoring rule.

report a correction →

The Kadavath-Tian-Kalai synthesis. Three findings, in chronological order, told a coherent story only after Kalai et al. supplied the mechanism:

Kadavath et al. (2022) showed that pre-RLHF base models are well calibrated on multiple-choice questions where the answer set is fixed. Self-evaluation prompts ("Is your answer correct? Yes / No") track the base model's actual accuracy.
Tian et al. (EMNLP 2023) showed that RLHF substantially degrades calibration on the same multiple-choice tasks; verbalized confidence elicited via prompting partly recovers it. The OpenAI GPT-4 system card reported the same pattern: the base model was better calibrated than the post-trained one.
Kalai et al. (2025) Section 4 explained why this happens. RLHF maximizes preference-model score, and preference data is built from human comparisons of answers; humans rank confident-correct answers above hedged-correct answers. The reward model inherits this preference, so KL-penalized RL trains the policy to be confident. Calibration is collateral damage from a reward shape that does not value uncertainty.

The fix is not "use better preference data." Confident-correct beats hedged-correct in any human-comparison protocol that does not actively penalize overconfidence. The fix has to come from the rubric.

The Evaluation-Incentive Equilibrium

Standard benchmarks (MMLU, GPQA, HumanEval, BBH, MATH, GSM8K) score $1$ for a correct answer and $0$ for a wrong answer or an abstention. Under this rubric, the policy that maximizes benchmark score is the policy that always guesses, even when uncertain. Kalai et al. formalize this in Section 4.

Binary grading vs behavioral calibration (Kalai et al. 2025, Observation 1)

Proposition

Binary grading makes abstention strictly suboptimal (Kalai-Nachum-Vempala-Zhang 2025, Observation 1)

Statement

For any binary grading rubric and any prompt where the model's confidence in some answer is $p > 0$ , the expected score from emitting that answer is $p > 0$ and the expected score from abstaining is $0$ . Abstention is therefore strictly dominated by guessing whenever the model has any nonzero confidence in any candidate.

Intuition

Binary grading collapses two distinct outcomes (wrong and "I don't know") into the same score of zero. The model has no incentive to express uncertainty; expressing it is identical to being wrong. The optimal policy under such a rubric is to commit to the most likely candidate at every prompt, which is exactly the behavior we call hallucination.

Proof Sketch

Let $A$ be the event "answer is correct" and $a$ the candidate answer. Score is $\mathbf{1}[A]$ given that $a$ is emitted, and $0$ given abstention. Expected score from emitting $a$ is $\Pr(A) = p$ ; expected score from abstaining is $0$ . Whenever $p > 0$ , emitting strictly dominates abstention.

Why It Matters

This reframes hallucination as an equilibrium response to the field's measurement protocol, not a property of models in isolation. A model that hallucinates on PopQA is doing the only thing the benchmark rewards. Changing the model without changing the rubric will not change the equilibrium. The leverage is on the evaluation side.

Failure Mode

Some benchmarks already include partial credit, weighted scoring, or abstention rewards (TruthfulQA's MC2 averages probability mass across correct options, for instance). The observation applies strictly to the dominant binary-graded benchmarks; it does not say all benchmarks are misdesigned.

report a correction →

Behavioral calibration as the proposed fix. Kalai et al. propose that benchmarks state an explicit confidence target in the prompt:

Answer only if you are more than $t$ confident, since mistakes are penalized $t/(1-t)$ points while correct answers receive $1$ point.

Under this rubric, the expected score from emitting a candidate with confidence $p$ is $p \cdot 1 + (1-p) \cdot (-t/(1-t)) = (p - t)/(1-t)$ , which is positive iff $p > t$ . Abstention is now optimal whenever confidence falls below the target. The model becomes incentivized to track its own uncertainty across thresholds; the audit asks whether reported confidence matches accuracy across the threshold sweep. This is what they call behavioral calibration: not just numerical calibration of softmax outputs, but the broader property that the model's decisions to commit or abstain track its actual reliability.

The reform is cheap. It requires changing the grading rubric and the prompt, not the model. Whether benchmarks adopt it is a coordination problem in the field, not a research problem.

Structural Failures Beyond Calibration

Calibration explains a slice of hallucination but misses several phenomena entirely.

The reversal curse (Berglund et al. 2023). Models trained on " $A$ is $B$ " do not generalize to " $B$ is $A$ ." On real-world celebrity examples, GPT-4 answered "Who is Tom Cruise's mother?" correctly $79\%$ of the time but answered "Who is Mary Lee Pfeiffer's son?" correctly only $33\%$ of the time, despite the same fact being involved. The asymmetry persists across model sizes and families, and it is not fixed by data augmentation. This is a structural property of next-token prediction: the gradient signal for " $A$ is $B$ " updates the conditional $\hat{p}(\,B \mid A\,)$ , not the conditional $\hat{p}(\,A \mid B\,)$ . No amount of calibration repair touches it.

Snowballing hallucinations (Zhang, Press, Merrill, Liu, Smith 2023). When a model commits to an early wrong answer, it generates fluent supporting justification consistent with the wrong answer rather than catching the mistake. The striking finding is that the same model, asked separately to verify each claim in its own snowballed answer, recognizes the errors at $67\%$ accuracy for ChatGPT and $87\%$ for GPT-4. The model has the relevant knowledge; sequential decoding committed to a path that knowledge could have ruled out. This is a decoding-procedure phenomenon, not a knowledge gap, and it is invisible to single-token calibration analysis.

Internal-state probes (Azaria and Mitchell 2023). Linear classifiers trained on hidden-layer activations distinguish true from false statements at $71$ to $83\%$ accuracy across model layers, outperforming methods that use only output probabilities. Models often "know" they are emitting falsehoods in a representational sense but do not surface this in their decoded output. This suggests calibration is at least partly a decoding problem, not a knowledge problem, and points toward decoding-time interventions (probe-conditioned generation, abstain-on-low-probe-confidence) as a research direction.

These three findings sit awkwardly with the standard "the model does not know what it does not know" story. The reversal curse shows the model genuinely lacks a piece of structure that a human would derive automatically. Snowballing and internal-state probes show the model has knowledge that decoding fails to use. Calibration repair addresses neither.

Measurement: Atomic Decomposition

How do you score hallucination on free-form generation, where the model emits a paragraph that mixes correct and incorrect claims? The dominant approach since 2023 is to decompose generations into atomic facts and verify each.

FActScore (Min, Krishna, Lyu, Lewis, Yih, Koh, Iyyer, Zettlemoyer, Hajishirzi, EMNLP 2023). A long-form generation is split into atomic facts; each fact is verified against a reliable knowledge source. The reported score is the fraction of supported atomic facts. Their automated estimator agrees with human verification within $2\%$ . On biography generation, ChatGPT scored only $58\%$ , providing the first quantitative bound on what "factual" means for paragraph-length generation.

SAFE / LongFact (Wei et al., NeurIPS 2024). The Search-Augmented Factuality Evaluator extends FActScore by issuing Google searches to verify each atomic fact, scoring at $72\%$ agreement with crowdsourced human annotations and $76\%$ agreement on disagreement cases, at roughly $20\times$ lower cost than human annotation. The companion LongFact benchmark provides thousands of long-form questions across $38$ topics.

The measurement gap matters because the singleton-rate bound and the evaluation-incentive observation both make claims about "the rate" of hallucination, but until atomic decomposition there was no operational definition of that rate for free-form text. FActScore and SAFE are the de facto standards. They are not perfect (the Google-search verifier inherits Google's coverage gaps), but they make the rate measurable.

Mitigation: Where Each Approach Fits

Retrieval-augmented generation (RAG). Lewis et al. (2020) introduced the standard RAG architecture: retrieve passages from a knowledge base, condition generation on them. Mallen et al. (2023) showed RAG dominates parametric memory on long-tail entities. The architecture targets the singleton-rate bound directly: by moving rare facts from parametric storage to the retrieved context, the model no longer needs to memorize them. The well-known limits: the retriever can fail; the model can ignore retrieved context (Yoran et al. 2024 quantify this); retrieved documents themselves can be wrong; and RAG does not address structural failures like the reversal curse or snowballing.

Conformal language modeling (Quach et al., ICLR 2024). Apply conformal prediction to LLM generation: sample candidates, calibrate a stopping rule on a held-out set, return a prediction set with marginal coverage guarantee. The set is small when the model is sure and large when it is not. The guarantee is distribution-free under exchangeability.

Conformal factuality (Mohri and Hashimoto 2024). A back-off algorithm: progressively make LM outputs less specific until conformal coverage holds. The reported result is $80$ to $90\%$ correctness guarantees while retaining most of the original output. This is the version of conformal that concretely targets the hallucination problem and it earns its place on this page.

The underlying guarantee is the standard split-conformal coverage bound; see the marginal-coverage theorem on the calibration page for the full statement and proof. The LM-specific contribution of Quach et al. and Mohri-Hashimoto is the construction of a non-conformity score for free-form text — token-level log-probabilities are uninformative, so both groups build scores out of semantic-equivalence checks (NLI for Quach, atomic-claim entailment for Mohri-Hashimoto). The coverage guarantee transfers; the engineering content is the score.

Watch Out

Marginal coverage is not conditional coverage on hard prompts. A factuality-conformal LLM with $90\%$ marginal coverage can have $50\%$ coverage on the long-tail entities where you most want a guarantee, balanced by $99\%$ coverage on easy ones. Per-prompt guarantees require either Mondrian conformal partitioning by a known difficulty signal or distribution-conditional methods like Romano-Patterson-Candès CQR — neither of which is free, and both of which require defining the difficulty bins. This is the gap between "the algorithm works on average" and "the algorithm works on the cases that scared you into using it."

Self-consistency and sample-then-verify. Sample $k$ generations, take the majority answer (Wang et al. 2023, ICLR). This is cheap and helps on tasks with a single correct answer; it is irrelevant on open-ended generation.

Decoding interventions from internal-state probes. Conditional generation that abstains when an Azaria-Mitchell-style probe flags low truthfulness. This direction is active but not yet standard practice.

Common Confusions

Watch Out

Hallucination is not a defect to be patched; it is a statistical floor

The Kalai singleton-rate bound says that pretraining error is bounded below by the fraction of facts in the corpus that appear exactly once. No amount of training data, model capacity, or alignment can drive base-model hallucination on the long tail to zero. The hope is to push the rate down on the head and to install honest abstention on the tail; the hope is not to eliminate it.

Watch Out

Calibration degrading after RLHF is not a tuning bug

Tian et al. (2023) and the GPT-4 system card both document the post-RLHF calibration drop. Kalai Section 4 explains the mechanism: human preference data ranks confident-correct above hedged-correct, the reward model inherits this, and the policy is trained to be confident. Better preference data does not fix this; only a rubric that rewards calibrated abstention does.

Watch Out

Hallucination is not the same as lying

Lying requires intent. A language model executes a sampled distribution over tokens; intent is not a meaningful attribute. The closer human analogue is confabulation: filling in gaps with plausible patterns. Anthropomorphizing into "the model knows it is wrong" mostly fails as a frame, with one wrinkle: Azaria-Mitchell shows internal probes do detect falsehoods at well-above-chance rates, so a weaker version of "the model has access to its own unreliability" is empirically defensible.

Watch Out

Low perplexity does not bound factual correctness

Perplexity measures how well the model predicts the next token in a held-out reference text. It does not measure whether the model can generate true statements when prompted to. A model with state-of-the-art perplexity can still fail PopQA on rare entities, fail the reverse direction of facts it was trained on, and snowball wrong answers into fluent paragraphs. The metric and the property are different.

Watch Out

Conformal prediction is not a free lunch for LLMs

Marginal coverage is not conditional coverage. A conformal LLM that achieves $90\%$ marginal coverage can have $50\%$ coverage on one slice of prompts and $99\%$ on another. The guarantee is honest about what it is and is not. For high-stakes per-input use, the relevant tool is some flavor of conditional or distributional conformal (Romano-Sesia-Candes 2019, Cauchois et al. 2024), not the marginal version.

Exercises

ExerciseCore

Problem

A pretraining corpus contains $N = 10^9$ facts (prompt-response pairs). Suppose the singleton rate is $\mathrm{sr} = 0.18$ and that all error sets satisfy $|\mathcal{E}_c| \geq 100$ . Ignoring the calibration term $\delta$ , give a numerical lower bound on the base model's generation error rate.

ExerciseAdvanced

Problem

A benchmark has $100$ questions. A model outputs a probability $p_i$ on its top candidate for each question $i$ . Under a binary $0/1$ rubric (right $= 1$ , wrong or abstain $= 0$ ), the expected score from always answering with the top candidate is $\sum_i p_i$ . Now consider Kalai's confidence-target rubric: the prompt sets a threshold $t$ , the model gets $1$ for correct, $-t/(1-t)$ for wrong, and $0$ for abstention.

(a) Write the model's expected score under the confidence-target rubric as a function of $\{p_i\}$ and the abstention decisions, assuming the model abstains when $p_i < t$ and emits its top candidate when $p_i \geq t$ .

(b) Show that this abstention threshold is the optimal policy given the rubric, that is, no other abstention rule yields a higher expected score.

(c) Pick $t = 0.7$ and a distribution where $50$ questions have $p_i = 0.9$ and $50$ have $p_i = 0.5$ . Compute the expected score under both rubrics. Which produces a more honest measurement of the model's calibrated knowledge?

ExerciseAdvanced

Problem

The reversal curse (Berglund et al. 2023) says a model trained on " $A$ is $B$ " does not learn to predict " $A$ " given " $B$ ." Argue, from the structure of the cross-entropy loss on autoregressive next-token prediction, why this is the expected behavior. Specifically: write the loss contributed by the sentence "Mary Lee Pfeiffer is the mother of Tom Cruise" and identify which conditional distributions the gradient updates and which it does not. Then describe an architectural or training-data intervention that would address the asymmetry without breaking standard left-to-right generation.

ExerciseResearch

Problem

Design an evaluation protocol that measures behavioral calibration in the Kalai sense for a frontier LLM on a long-tail factual QA task. Your protocol must:

(a) Use the explicit-confidence-target rubric across multiple thresholds $t \in \{0.5, 0.7, 0.9\}$ . (b) Distinguish models that hedge appropriately (high score at $t = 0.9$ ) from models that hedge indiscriminately. (c) Be auditable by a third party without access to the model weights. (d) Produce a single summary statistic that ranks models meaningfully.

Address: how do you communicate the rubric to the model? How do you measure the model's effective threshold? What is the failure mode of your protocol?

Frequently Asked Questions

$Why do LLMs hallucinate?$: $Two reasons. First, a statistical lower bound: Kalai-Nachum-Vempala-Zhang (2025) prove the base-model error rate on rare facts is at least the singleton rate, the fraction of facts that appear exactly once in training. Second, an incentive equilibrium: every standard benchmark scores abstention as wrong, so post-training optimizes for confident bluffing rather than honest uncertainty.$
$Will scaling fix hallucinations?$: $No, not entirely. Scaling reduces the singleton rate by repeating more facts, but the long tail of rare facts persists at any finite training set. The deeper issue: scaling does not change the benchmark incentive structure that rewards bluffing. Both halves of the problem need to move.$
$Does RLHF make hallucinations better or worse?$: $It can improve preferred behavior while still hurting calibration. RLHF fine-tunes against human preference signals; if humans or learned reward models reward confident answers more than hedged ones, the policy can learn to bluff. The fix is to include explicit calibration objectives or abstention rewards, not assume preference data captures factual uncertainty.$
$How is conformal prediction useful here?$: $Conformal prediction gives distribution-free coverage guarantees: a confidence set with provable probability of containing the true answer. It does not reduce hallucination per se, but it lets a deployed system fail safely by abstaining when the conformal set is too large. It addresses the symptom (deploy-time consequences) rather than the cause.$
$What problems can calibration not solve?$: $Structural failures. The reversal curse (knowing 'A is B' does not imply knowing 'B is A'), snowballing within a single chain (one early error compounds across reasoning steps), and long-tail entity confusion are not calibration problems. Even a perfectly calibrated model exhibits them. These need retrieval, tool use, or architectural changes.$

References

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang. Why Language Models Hallucinate. September 2025. The Is-It-Valid reduction (Section 3.1, Theorem 1), the singleton-rate lower bound on pretraining error (Section 3.3.1, Theorem 2), and the binary-grading-incentive observation that explains why post-training does not fix hallucination (Section 4). arXiv:2509.04664
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". 2023. Forward 79% vs reverse 33% on celebrity examples; persists across model sizes; not fixed by data augmentation. arXiv:2309.12288
Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith. How Language Model Hallucinations Can Snowball. 2023. Models over-commit to early mistakes and fluently elaborate on them; the same models recognize the errors when asked separately at 67% (ChatGPT) and 87% (GPT-4). arXiv:2305.13534
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL 2023. Introduces PopQA (14k entity questions stratified by popularity); scaling fails to improve long-tail accuracy; retrieval-augmented smaller models beat much larger parametric ones on rare entities. arXiv:2212.10511
Amos Azaria, Tom Mitchell. The Internal State of an LLM Knows When It's Lying. 2023. Hidden-layer linear probes distinguish true from false statements at 71-83% accuracy, outperforming probability-based methods. arXiv:2304.13734
Saurav Kadavath et al. Language Models (Mostly) Know What They Know. 2022. Pre-RLHF base models are calibrated on multiple-choice tasks; self-evaluation prompts track actual accuracy. arXiv:2207.05221
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023. Documents post-RLHF calibration degradation; verbalized confidence prompting partially recovers it. arXiv:2305.14975
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. Atomic-fact decomposition metric for long-form generation; automated estimator within 2% of human verification; ChatGPT scores 58% on biographies. arXiv:2305.14251
Jerry Wei et al. Long-form Factuality in Large Language Models. NeurIPS 2024. SAFE (Search-Augmented Factuality Evaluator) verifies each atomic fact via Google search; LongFact benchmark (38 topics); 72% agreement with crowdsourced annotators at 20x lower cost. arXiv:2403.18802
Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, Regina Barzilay. Conformal Language Modeling. ICLR 2024. Calibrated sampling of candidate sets for free-form generation with marginal coverage guarantee. arXiv:2306.10193
Christopher Mohri, Tatsunori Hashimoto. Language Models with Conformal Factuality Guarantees. 2024. A back-off algorithm delivering 80-90% correctness guarantees while retaining most of the original LM output, by progressively making outputs less specific. arXiv:2402.10978
Patrick Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. The original RAG architecture combining a parametric generator with a non-parametric retriever. arXiv:2005.11401
Morris H. DeGroot, Stephen E. Fienberg. The Comparison and Evaluation of Forecasters. The Statistician 32(1-2), 1983. Sections 2-4 for calibration, refinement, sufficiency, and proper scoring rules.
Tilmann Gneiting, Adrian E. Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association 102(477), 2007. Sections 2-4 for propriety, entropy, and scoring-rule examples.
Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger. On Calibration of Modern Neural Networks. ICML 2017. The standard binned ECE estimator; temperature scaling as a post-hoc fix. arXiv:1706.04599
Vladimir Vovk, Alex Gammerman, Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005. Chapter 2 for conformal prediction and Chapter 3 for classification with conformal predictors.
Anastasios N. Angelopoulos, Stephen Bates. Conformal Prediction: A Gentle Introduction. Foundations and Trends in Machine Learning 16(4), 2023. Practitioner-friendly survey. arXiv:2107.07511
Ziwei Ji et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 2023. Sections 2-4 for definitions, metrics, mitigation methods, and task-level taxonomy.

Next Topics

RLHF and alignment: why RLHF reduces blatant hallucination but does not target the singleton-rate floor
RLHF deep dive: the reward-shape mechanism behind post-training calibration degradation
Mechanistic interpretability: can we localize the circuits that hidden-state truth probes are reading?
Reward hacking: the broader Goodhart frame for why proxy optimization fails

Last reviewed: July 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Empirical Risk Minimizationlayer 2 · tier 1
Calibration and Uncertainty Quantificationlayer 3 · tier 2
RLHF and Alignmentlayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2

Derived topics

1

Reward Hackinglayer 5 · tier 2

Graph-backed continuations

Reward Hacking