AI Safety
Truth Directions and Linear Probes
The geometry-of-truth program asks whether transformer residual streams linearly separate true from false factual statements. Linear probes can reveal such structure; activation interventions test whether the direction is merely correlational or actually behavior-changing.
Why This Matters
If we want a serious future artifact around the Geometry of Truth, we need the conceptual stack clear first.
The claim is not that language models contain a magical "truth neuron." The claim is more modest and more interesting: at sufficient scale, a transformer's residual stream may organize true and false factual statements in a way that is linearly readable. If that is true, then:
- a linear probe can classify true versus false statements from internal activations,
- a steering vector can move the model toward one side of that geometry,
- and causal interventions can test whether the linearly recovered direction is something the model is actually using.
That matters for interpretability, alignment, and future labs. It is the bridge between pure representation analysis and intervention-based control.
A truth direction is a separating feature in residual-stream space
The workflow is: collect true and false factual statements, read a residual stream layer, fit a linear separator, and then test whether moving activations along that direction actually changes the model's next-token beliefs.
probe space
Green points are true statements, rose points are false statements, and the dashed diagonal is the learned separating hyperplane.
causal intervention
The gold step adds a scaled truth vector at one layer; what matters is whether the model's own output beliefs flip with it.
Probe first
A linear probe learns a normal vector so that separates residual activations from true and false statements.
Intervene second
A high-accuracy probe is not enough. The causal question is whether adding or subtracting that direction changes the model's own continuation probabilities.
The modern caveat
Recent follow-up work shows that not every model has a stable truth direction across tasks. Stronger models tend to make the linear structure cleaner; weaker ones can be inconsistent.
Mental Model
Take a batch of statements like:
- "Paris is the capital of France."
- "The Pacific Ocean is smaller than the Atlantic Ocean."
Run them through a transformer and record the residual stream at a chosen layer, usually on the final token position. Each statement becomes a vector in high-dimensional activation space.
Now ask a geometric question: are the vectors for true statements and false statements separated by a hyperplane?
If yes, then a normal vector to that hyperplane is a truth direction in the limited, operational sense used by this literature. It does not mean the model has solved philosophy. It means truth-relevant information has become linearly accessible in that activation space for that dataset and that layer.
Formal Setup
Linear probe
A linear probe is a classifier of the form
where is an activation vector from some layer and token position, is the probe direction, and is a bias term. For probabilistic probes, one often uses logistic regression:
The important object is still the direction .
Truth direction
Within a given dataset and model layer, a truth direction is a direction such that the signed projection separates or substantially discriminates true from false statements.
This is an empirical notion, not a universal semantic primitive. A truth direction can be stable, unstable, task-specific, or misleading depending on the dataset and the model.
Activation steering
Activation steering modifies a hidden state during inference. If a layer emits activation , a steering intervention replaces it with
for some steering direction and scale . The causal question is then whether the model's own next-token distribution changes in the intended way.
Main Propositions
At the final residual stream, steering changes logits linearly
Statement
At the final residual stream before unembedding, if logits are
then the steering intervention
changes the logits exactly by
So at that location, steering along is literally a linear logit update.
Intuition
At the final layer there is no mystery left about what happens next: the unembedding matrix reads the residual stream linearly. Adding a direction before unembedding directly shifts the logits in the corresponding vocabulary directions.
Proof Sketch
Substitute into the unembedding:
No approximation is needed at the final readout layer.
Why It Matters
This explains why representation engineering can work at all. If a probe recovers a useful direction late in the residual stream, then adding that direction has a mechanically simple path to changing the model's output distribution.
A perfect probe can still track the wrong feature
Statement
Suppose training activations have the form
with on the training set. Then both probe directions
classify the training data perfectly, even though only is aligned with the actual target coordinate . On a transfer set where and decouple, can fail arbitrarily badly.
Therefore perfect probe accuracy on one dataset does not by itself identify a causal or semantically correct direction.
Intuition
If a nuisance feature rides along with the real feature during training, a probe can latch onto the nuisance and still look perfect. This is why transfer tests and interventions matter.
Proof Sketch
On the training set, , so the decision rules and both reduce to . On any transfer set where differs from , the second rule follows the nuisance coordinate instead of the true label.
Why It Matters
This is the core warning for truth-direction research. A probe can be impressive, visually clean, and still not represent what we think it represents. Causal interventions and cross-task generalization are not optional extras; they are the credibility check.
What The Geometry-of-Truth Result Actually Says
Marks and Tegmark's result is best read as an emergent linear-structure claim. On carefully curated true/false factual datasets, sufficiently capable models show a residual-stream geometry in which truth and falsehood are surprisingly linearly separable.
That is already interesting. It suggests:
- truth-relevant structure is present before the final answer token is emitted,
- the structure is not purely nonlinear noise,
- and simple probes may capture something real about internal belief state.
But the result is narrower than many retellings imply. It does not by itself prove:
- that there is one universal truth axis across all tasks,
- that steering this axis will always improve factuality,
- that the probe is causally faithful,
- or that every model exposes the same clean linear geometry.
Recent follow-up work is exactly about this gap: transfer, consistency, and task dependence.
Probe First, Then Intervene
The clean workflow is:
- collect paired true/false factual statements,
- extract residual activations from a chosen layer and token position,
- fit a linear probe,
- test out-of-distribution generalization,
- intervene along the candidate direction,
- check whether next-token predictions or truth judgments actually move.
That fifth step is what turns a pretty probe plot into a mechanistic claim.
Why This Matters For Future Labs
If we later build a flagship Geometry of Truth / linear probe lab, the page stack should already make the epistemic rules clear:
- which model and layer are being probed,
- what the positive and negative datasets are,
- what the transfer split is,
- what scale parameter is used for intervention,
- and whether the intervention flips only a benchmark label or genuinely shifts next-token probabilities in a coherent way.
That is the difference between "we drew a separating line" and "we built a real representation-engineering experiment."
Common Confusions
A truth direction is not proof of inner honesty
Linear separability of true and false statements means information is accessible in the residual stream. It does not prove the model has a single stable inner notion of truth across tasks, prompts, and instructions.
Probe accuracy and causal importance are different questions
A probe asks whether a direction carries information. An intervention asks whether changing that direction changes behavior. The second question is the harder and more trustworthy one.
Truth directions are layer- and dataset-dependent
Different layers can expose different levels of abstraction. A direction that works on declarative factual statements may not transfer cleanly to question answering, logical transformations, or instruction-following data.
Exercises
Problem
Why is a linear probe naturally described by a direction rather than by its predicted labels alone?
Problem
Suppose a final-layer truth-direction intervention increases the logit of a truthful token but also pushes several unrelated tokens upward. Why is that not surprising?
Problem
What evidence would you require before claiming that a learned truth direction generalizes from declarative factual statements to real question answering?
References
- Samuel Marks and Max Tegmark, The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets, arXiv 2023 / COLM 2024. Foundational truth-direction paper.
- Thomas Burger, Armin Hahne, and Siegfried Handschuh, Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks, Findings of ACL 2025. Best follow-up on transfer and generalization limits.
- Andy Zou et al., Representation Engineering: A Top-Down Approach to AI Transparency, arXiv 2023. High-level framing for population-level representation steering.
- Jeremy Rimsky et al., Steering Llama 2 via Contrastive Activation Addition, ACL 2024. Practical intervention paper for residual-stream steering.
- Alexander Turner et al., Steering Language Models With Activation Engineering, arXiv 2023. Early activation-addition reference connecting probe-like directions to controllable behavior.
Next Topics
If this page asks whether truth is linearly represented and steerable, the next questions are:
- Sparse Autoencoders for discovering richer feature dictionaries than a single probe direction,
- Mechanistic Interpretability for the broader causal toolkit,
- and Theorem Proving in Lean if we want to contrast latent "belief geometry" with explicit formal verification.
Last reviewed: April 25, 2026
Prerequisites
Foundations this topic depends on.
- Mechanistic Interpretability: Features, Circuits, and Causal FaithfulnessLayer 4
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Continuity in RⁿLayer 0A
- Metric Spaces, Convergence, and CompletenessLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Principal Component AnalysisLayer 1
- Singular Value DecompositionLayer 0A
- Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and ScalingLayer 4
- AutoencodersLayer 2
- Lasso RegressionLayer 2
- Linear RegressionLayer 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyLayer 0B
- Common Probability DistributionsLayer 0A
- Central Limit TheoremLayer 0B
- Law of Large NumbersLayer 0B
- Random VariablesLayer 0A
- Kolmogorov Probability AxiomsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- KL DivergenceLayer 1
- Information Theory FoundationsLayer 0B
- Residual Stream and Transformer InternalsLayer 4