Skip to main content

AI Safety

Truth Directions and Linear Probes

The geometry-of-truth program asks whether transformer residual streams linearly separate true from false factual statements. Linear probes can reveal such structure; activation interventions test whether the direction is merely correlational or actually behavior-changing.

ResearchTier 2Frontier~70 min

Why This Matters

If we want a serious future artifact around the Geometry of Truth, we need the conceptual stack clear first.

The claim is not that language models contain a magical "truth neuron." The claim is more modest and more interesting: at sufficient scale, a transformer's residual stream may organize true and false factual statements in a way that is linearly readable. If that is true, then:

  • a linear probe can classify true versus false statements from internal activations,
  • a steering vector can move the model toward one side of that geometry,
  • and causal interventions can test whether the linearly recovered direction is something the model is actually using.

That matters for interpretability, alignment, and future labs. It is the bridge between pure representation analysis and intervention-based control.

A truth direction is a separating feature in residual-stream space

The workflow is: collect true and false factual statements, read a residual stream layer, fit a linear separator, and then test whether moving activations along that direction actually changes the model's next-token beliefs.

false+αvflips belief

probe space

Green points are true statements, rose points are false statements, and the dashed diagonal is the learned separating hyperplane.

causal intervention

The gold step adds a scaled truth vector at one layer; what matters is whether the model's own output beliefs flip with it.

Probe first

A linear probe learns a normal vector so that separates residual activations from true and false statements.

Intervene second

A high-accuracy probe is not enough. The causal question is whether adding or subtracting that direction changes the model's own continuation probabilities.

The modern caveat

Recent follow-up work shows that not every model has a stable truth direction across tasks. Stronger models tend to make the linear structure cleaner; weaker ones can be inconsistent.

Mental Model

Take a batch of statements like:

  • "Paris is the capital of France."
  • "The Pacific Ocean is smaller than the Atlantic Ocean."

Run them through a transformer and record the residual stream at a chosen layer, usually on the final token position. Each statement becomes a vector in high-dimensional activation space.

Now ask a geometric question: are the vectors for true statements and false statements separated by a hyperplane?

If yes, then a normal vector vv to that hyperplane is a truth direction in the limited, operational sense used by this literature. It does not mean the model has solved philosophy. It means truth-relevant information has become linearly accessible in that activation space for that dataset and that layer.

Formal Setup

Definition

Linear probe

A linear probe is a classifier of the form

y^=sign(wh+b),\hat{y} = \mathrm{sign}(w^\top h + b),

where hRdh \in \mathbb R^d is an activation vector from some layer and token position, wRdw \in \mathbb R^d is the probe direction, and bb is a bias term. For probabilistic probes, one often uses logistic regression:

P(y=1h)=σ(wh+b).\mathbb P(y = 1 \mid h) = \sigma(w^\top h + b).

The important object is still the direction ww.

Definition

Truth direction

Within a given dataset and model layer, a truth direction is a direction vv such that the signed projection h,v\langle h, v \rangle separates or substantially discriminates true from false statements.

This is an empirical notion, not a universal semantic primitive. A truth direction can be stable, unstable, task-specific, or misleading depending on the dataset and the model.

Definition

Activation steering

Activation steering modifies a hidden state during inference. If a layer emits activation hh, a steering intervention replaces it with

h=h+αvh' = h + \alpha v

for some steering direction vv and scale α\alpha. The causal question is then whether the model's own next-token distribution changes in the intended way.

Main Propositions

Proposition

At the final residual stream, steering changes logits linearly

Statement

At the final residual stream before unembedding, if logits are

z=WUh+b,z = W_U h + b,

then the steering intervention

h=h+αvh' = h + \alpha v

changes the logits exactly by

z=z+αWUv.z' = z + \alpha W_U v.

So at that location, steering along vv is literally a linear logit update.

Intuition

At the final layer there is no mystery left about what happens next: the unembedding matrix reads the residual stream linearly. Adding a direction before unembedding directly shifts the logits in the corresponding vocabulary directions.

Proof Sketch

Substitute hh' into the unembedding:

z=WU(h+αv)+b=WUh+b+αWUv=z+αWUv.z' = W_U(h + \alpha v) + b = W_U h + b + \alpha W_U v = z + \alpha W_U v.

No approximation is needed at the final readout layer.

Why It Matters

This explains why representation engineering can work at all. If a probe recovers a useful direction late in the residual stream, then adding that direction has a mechanically simple path to changing the model's output distribution.

Proposition

A perfect probe can still track the wrong feature

Statement

Suppose training activations have the form

h=(y,c)h = (y, c)

with c=yc = y on the training set. Then both probe directions

w1=(1,0)andw2=(0,1)w_1 = (1,0) \quad \text{and} \quad w_2 = (0,1)

classify the training data perfectly, even though only w1w_1 is aligned with the actual target coordinate yy. On a transfer set where cc and yy decouple, w2w_2 can fail arbitrarily badly.

Therefore perfect probe accuracy on one dataset does not by itself identify a causal or semantically correct direction.

Intuition

If a nuisance feature rides along with the real feature during training, a probe can latch onto the nuisance and still look perfect. This is why transfer tests and interventions matter.

Proof Sketch

On the training set, c=yc = y, so the decision rules sign(w1h)\mathrm{sign}(w_1^\top h) and sign(w2h)\mathrm{sign}(w_2^\top h) both reduce to sign(y)\mathrm{sign}(y). On any transfer set where cc differs from yy, the second rule follows the nuisance coordinate instead of the true label.

Why It Matters

This is the core warning for truth-direction research. A probe can be impressive, visually clean, and still not represent what we think it represents. Causal interventions and cross-task generalization are not optional extras; they are the credibility check.

What The Geometry-of-Truth Result Actually Says

Marks and Tegmark's result is best read as an emergent linear-structure claim. On carefully curated true/false factual datasets, sufficiently capable models show a residual-stream geometry in which truth and falsehood are surprisingly linearly separable.

That is already interesting. It suggests:

  • truth-relevant structure is present before the final answer token is emitted,
  • the structure is not purely nonlinear noise,
  • and simple probes may capture something real about internal belief state.

But the result is narrower than many retellings imply. It does not by itself prove:

  • that there is one universal truth axis across all tasks,
  • that steering this axis will always improve factuality,
  • that the probe is causally faithful,
  • or that every model exposes the same clean linear geometry.

Recent follow-up work is exactly about this gap: transfer, consistency, and task dependence.

Probe First, Then Intervene

The clean workflow is:

  1. collect paired true/false factual statements,
  2. extract residual activations from a chosen layer and token position,
  3. fit a linear probe,
  4. test out-of-distribution generalization,
  5. intervene along the candidate direction,
  6. check whether next-token predictions or truth judgments actually move.

That fifth step is what turns a pretty probe plot into a mechanistic claim.

Why This Matters For Future Labs

If we later build a flagship Geometry of Truth / linear probe lab, the page stack should already make the epistemic rules clear:

  • which model and layer are being probed,
  • what the positive and negative datasets are,
  • what the transfer split is,
  • what scale parameter α\alpha is used for intervention,
  • and whether the intervention flips only a benchmark label or genuinely shifts next-token probabilities in a coherent way.

That is the difference between "we drew a separating line" and "we built a real representation-engineering experiment."

Common Confusions

Watch Out

A truth direction is not proof of inner honesty

Linear separability of true and false statements means information is accessible in the residual stream. It does not prove the model has a single stable inner notion of truth across tasks, prompts, and instructions.

Watch Out

Probe accuracy and causal importance are different questions

A probe asks whether a direction carries information. An intervention asks whether changing that direction changes behavior. The second question is the harder and more trustworthy one.

Watch Out

Truth directions are layer- and dataset-dependent

Different layers can expose different levels of abstraction. A direction that works on declarative factual statements may not transfer cleanly to question answering, logical transformations, or instruction-following data.

Exercises

ExerciseCore

Problem

Why is a linear probe naturally described by a direction ww rather than by its predicted labels alone?

ExerciseAdvanced

Problem

Suppose a final-layer truth-direction intervention increases the logit of a truthful token but also pushes several unrelated tokens upward. Why is that not surprising?

ExerciseResearch

Problem

What evidence would you require before claiming that a learned truth direction generalizes from declarative factual statements to real question answering?

References

Next Topics

If this page asks whether truth is linearly represented and steerable, the next questions are:

Last reviewed: April 25, 2026

Prerequisites

Foundations this topic depends on.

Next Topics