AI Safety

Truth Directions and Linear Probes

The geometry-of-truth program asks whether transformer residual streams linearly separate true from false factual statements. Linear probes can reveal such structure; activation interventions test whether the direction is merely correlational or actually behavior-changing.

ResearchTier 2Frontier~70 min

Prerequisites

Mechanistic Interpretability Residual Stream and Transformer Internals

Prereq Map

Why This Matters

If we want a serious future artifact around the Geometry of Truth, we need the conceptual stack clear first.

The claim is not that language models contain a magical "truth neuron." The claim is more modest and more interesting: at sufficient scale, a transformer's residual stream may organize true and false factual statements in a way that is linearly readable. If that is true, then:

a linear probe can classify true versus false statements from internal activations,
a steering vector can move the model toward one side of that geometry,
and causal interventions can test whether the linearly recovered direction is something the model is actually using.

That matters for interpretability, alignment, and future labs. It is the bridge between pure representation analysis and intervention-based control.

A truth direction is a separating feature in residual-stream space

The workflow is: collect true and false factual statements, read a residual stream layer, fit a linear separator, and then test whether moving activations along that direction actually changes the model's next-token beliefs.

probe space

Green points are true statements, rose points are false statements, and the dashed diagonal is the learned separating hyperplane.

causal intervention

The gold step adds a scaled truth vector at one layer; what matters is whether the model's own output beliefs flip with it.

Probe first

$A linear probe learns a normal vector v so that ⟨ h, v ⟩ separates residual activations from true and false statements.$

Intervene second

A high-accuracy probe is not enough. The causal question is whether adding or subtracting that direction changes the model's own continuation probabilities.

The modern caveat

Recent follow-up work shows that not every model has a stable truth direction across tasks. Stronger models tend to make the linear structure cleaner; weaker ones can be inconsistent.

Mental Model

Take a batch of statements like:

"Paris is the capital of France."
"The Pacific Ocean is smaller than the Atlantic Ocean."

Run them through a transformer and record the residual stream at a chosen layer, usually on the final token position. Each statement becomes a vector in high-dimensional activation space.

Now ask a geometric question: are the vectors for true statements and false statements separated by a hyperplane?

If yes, then a normal vector $v$ to that hyperplane is a truth direction in the limited, operational sense used by this literature. It does not mean the model has solved philosophy. It means truth-relevant information has become linearly accessible in that activation space for that dataset and that layer.

Formal Setup

Definition

Linear probe

A linear probe is a classifier of the form

\hat{y} = \mathrm{sign}(w^\top h + b),

where $h \in \mathbb R^d$ is an activation vector from some layer and token position, $w \in \mathbb R^d$ is the probe direction, and $b$ is a bias term. For probabilistic probes, one often uses logistic regression:

\mathbb P(y = 1 \mid h) = \sigma(w^\top h + b).

The important object is still the direction $w$ .

Definition

Truth direction

Within a given dataset and model layer, a truth direction is a direction $v$ such that the signed projection $\langle h, v \rangle$ separates or substantially discriminates true from false statements.

This is an empirical notion, not a universal semantic primitive. A truth direction can be stable, unstable, task-specific, or misleading depending on the dataset and the model.

Definition

Activation steering

Activation steering modifies a hidden state during inference. If a layer emits activation $h$ , a steering intervention replaces it with

h' = h + \alpha v

for some steering direction $v$ and scale $\alpha$ . The causal question is then whether the model's own next-token distribution changes in the intended way.

Main Propositions

Proposition

At the final residual stream, steering changes logits linearly

Statement

At the final residual stream before unembedding, if logits are

z = W_U h + b,

then the steering intervention

h' = h + \alpha v

changes the logits exactly by

z' = z + \alpha W_U v.

So at that location, steering along $v$ is literally a linear logit update.

Intuition

At the final layer there is no mystery left about what happens next: the unembedding matrix reads the residual stream linearly. Adding a direction before unembedding directly shifts the logits in the corresponding vocabulary directions.

Proof Sketch

Substitute $h'$ into the unembedding:

z' = W_U(h + \alpha v) + b = W_U h + b + \alpha W_U v = z + \alpha W_U v.

No approximation is needed at the final readout layer.

Why It Matters

This explains why representation engineering can work at all. If a probe recovers a useful direction late in the residual stream, then adding that direction has a mechanically simple path to changing the model's output distribution.

Proposition

A perfect probe can still track the wrong feature

Statement

Suppose training activations have the form

h = (y, c)

with $c = y$ on the training set. Then both probe directions

w_1 = (1,0) \quad \text{and} \quad w_2 = (0,1)

classify the training data perfectly, even though only $w_1$ is aligned with the actual target coordinate $y$ . On a transfer set where $c$ and $y$ decouple, $w_2$ can fail arbitrarily badly.

Therefore perfect probe accuracy on one dataset does not by itself identify a causal or semantically correct direction.

Intuition

If a nuisance feature rides along with the real feature during training, a probe can latch onto the nuisance and still look perfect. This is why transfer tests and interventions matter.

Proof Sketch

On the training set, $c = y$ , so the decision rules $\mathrm{sign}(w_1^\top h)$ and $\mathrm{sign}(w_2^\top h)$ both reduce to $\mathrm{sign}(y)$ . On any transfer set where $c$ differs from $y$ , the second rule follows the nuisance coordinate instead of the true label.

Why It Matters

This is the core warning for truth-direction research. A probe can be impressive, visually clean, and still not represent what we think it represents. Causal interventions and cross-task generalization are not optional extras; they are the credibility check.

What The Geometry-of-Truth Result Actually Says

Marks and Tegmark's result is best read as an emergent linear-structure claim. On carefully curated true/false factual datasets, sufficiently capable models show a residual-stream geometry in which truth and falsehood are surprisingly linearly separable.

That is already interesting. It suggests:

truth-relevant structure is present before the final answer token is emitted,
the structure is not purely nonlinear noise,
and simple probes may capture something real about internal belief state.

But the result is narrower than many retellings imply. It does not by itself prove:

that there is one universal truth axis across all tasks,
that steering this axis will always improve factuality,
that the probe is causally faithful,
or that every model exposes the same clean linear geometry.

Recent follow-up work is exactly about this gap: transfer, consistency, and task dependence.

Probe First, Then Intervene

The clean workflow is:

collect paired true/false factual statements,
extract residual activations from a chosen layer and token position,
fit a linear probe,
test out-of-distribution generalization,
intervene along the candidate direction,
check whether next-token predictions or truth judgments actually move.

That fifth step is what turns a pretty probe plot into a mechanistic claim.

Why This Matters For Future Labs

If we later build a flagship Geometry of Truth / linear probe lab, the page stack should already make the epistemic rules clear:

which model and layer are being probed,
what the positive and negative datasets are,
what the transfer split is,
what scale parameter $\alpha$ is used for intervention,
and whether the intervention flips only a benchmark label or genuinely shifts next-token probabilities in a coherent way.

That is the difference between "we drew a separating line" and "we built a real representation-engineering experiment."

Common Confusions

Watch Out

A truth direction is not proof of inner honesty

Linear separability of true and false statements means information is accessible in the residual stream. It does not prove the model has a single stable inner notion of truth across tasks, prompts, and instructions.

Watch Out

Probe accuracy and causal importance are different questions

A probe asks whether a direction carries information. An intervention asks whether changing that direction changes behavior. The second question is the harder and more trustworthy one.

Watch Out

Truth directions are layer- and dataset-dependent

Different layers can expose different levels of abstraction. A direction that works on declarative factual statements may not transfer cleanly to question answering, logical transformations, or instruction-following data.

Exercises

ExerciseCore

Problem

Why is a linear probe naturally described by a direction $w$ rather than by its predicted labels alone?

ExerciseAdvanced

Problem

Suppose a final-layer truth-direction intervention increases the logit of a truthful token but also pushes several unrelated tokens upward. Why is that not surprising?

ExerciseResearch

Problem

What evidence would you require before claiming that a learned truth direction generalizes from declarative factual statements to real question answering?

References

Samuel Marks and Max Tegmark, The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets, arXiv 2023 / COLM 2024. Foundational truth-direction paper.
Thomas Burger, Armin Hahne, and Siegfried Handschuh, Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks, Findings of ACL 2025. Best follow-up on transfer and generalization limits.
Andy Zou et al., Representation Engineering: A Top-Down Approach to AI Transparency, arXiv 2023. High-level framing for population-level representation steering.
Jeremy Rimsky et al., Steering Llama 2 via Contrastive Activation Addition, ACL 2024. Practical intervention paper for residual-stream steering.
Alexander Turner et al., Steering Language Models With Activation Engineering, arXiv 2023. Early activation-addition reference connecting probe-like directions to controllable behavior.

Next Topics

If this page asks whether truth is linearly represented and steerable, the next questions are:

Sparse Autoencoders for discovering richer feature dictionaries than a single probe direction,
Mechanistic Interpretability for the broader causal toolkit,
and Theorem Proving in Lean if we want to contrast latent "belief geometry" with explicit formal verification.

Last reviewed: April 25, 2026

Prerequisites

Foundations this topic depends on.

Mechanistic Interpretability: Features, Circuits, and Causal FaithfulnessLayer 4
Transformer ArchitectureLayer 4
Attention Mechanism TheoryLayer 4
Matrix Operations and PropertiesLayer 0A
Sets, Functions, and RelationsLayer 0A
Basic Logic and Proof TechniquesLayer 0A
Softmax and Numerical StabilityLayer 1
Feedforward Networks and BackpropagationLayer 2
Differentiation in RnLayer 0A
Vectors, Matrices, and Linear MapsLayer 0A
Continuity in RⁿLayer 0A
Metric Spaces, Convergence, and CompletenessLayer 0A
Matrix CalculusLayer 1
The Jacobian MatrixLayer 0A
The Hessian MatrixLayer 0A
Eigenvalues and EigenvectorsLayer 0A
Activation FunctionsLayer 1
Convex Optimization BasicsLayer 1
Principal Component AnalysisLayer 1
Singular Value DecompositionLayer 0A
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and ScalingLayer 4
AutoencodersLayer 2
Lasso RegressionLayer 2
Linear RegressionLayer 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyLayer 0B
Common Probability DistributionsLayer 0A
Central Limit TheoremLayer 0B
Law of Large NumbersLayer 0B
Random VariablesLayer 0A
Kolmogorov Probability AxiomsLayer 0A
Expectation, Variance, Covariance, and MomentsLayer 0A
KL DivergenceLayer 1
Information Theory FoundationsLayer 0B
Residual Stream and Transformer InternalsLayer 4

Next Topics

Sparse AutoencodersContinue →Residual Stream and Transformer InternalsContinue →Mechanistic InterpretabilityContinue →