Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Latent Reasoning

Reasoning in hidden state space instead of generating chain-of-thought tokens: recurrent computation and continuous thought for scaling inference compute without scaling output length.

AdvancedTier 2Frontier~50 min
0

Why This Matters

Chain-of-thought (CoT) prompting improves LLM reasoning by generating intermediate tokens that externalize the computation. But this approach has a cost: every reasoning token consumes output bandwidth, increases latency, and is visible to the user. For a problem requiring 1000 reasoning steps, the model must generate 1000 tokens before producing an answer.

Latent reasoning asks: can we perform the same computation inside the model's hidden states, without generating visible tokens? Instead of producing a chain of tokens t1,t2,,tnt_1, t_2, \ldots, t_n, the model iterates its hidden state zf(z)f(f(z))z \to f(z) \to f(f(z)) \to \cdots for multiple steps, then produces the answer directly from the final hidden state.

This is early-stage research. The motivation: decouple inference compute from output length.

Mental Model

Standard CoT is like working a math problem by writing out every step on paper. Latent reasoning is like solving it in your head, only writing down the final answer. The internal computation still happens, but it operates on continuous vectors rather than discrete tokens. The "scratchpad" is the hidden state, not the output sequence.

Core Concepts

Definition

Chain-of-Thought Bottleneck

In standard chain-of-thought reasoning, inference compute scales linearly with the number of generated tokens. Each token requires a full forward pass through the model. The CoT bottleneck is the constraint that reasoning depth is coupled to output sequence length: more computation requires more output tokens.

Definition

Latent Reasoning

Latent reasoning performs iterative computation in the model's hidden state space without generating intermediate tokens. Given hidden state ztz_t at position tt, the model applies a recurrent function ff for MM steps:

zt(0)=zt,zt(m)=f(zt(m1))for m=1,,Mz_t^{(0)} = z_t, \quad z_t^{(m)} = f(z_t^{(m-1)}) \quad \text{for } m = 1, \ldots, M

The final state zt(M)z_t^{(M)} is then used to predict the next token. The parameter MM controls reasoning depth independently of output length.

Definition

Continuous Thought

Continuous thought replaces discrete token representations with continuous vector embeddings in the reasoning process. Instead of sampling a token xtx_t, embedding it, and feeding it back, the model directly passes the continuous hidden state to the next iteration. This eliminates the information bottleneck of discrete tokenization during reasoning.

Approaches

Recurrent Depth

The simplest form: apply the same transformer block (or a subset of blocks) multiple times. A standard transformer with LL layers computes:

z(l)=Blockl(z(l1))for l=1,,Lz^{(l)} = \text{Block}_l(z^{(l-1)}) \quad \text{for } l = 1, \ldots, L

A recurrent-depth model with LL layers and MM recurrence steps computes:

z(l,m)=Blockl(z(l1,m))for l=1,,Lz^{(l, m)} = \text{Block}_l(z^{(l-1, m)}) \quad \text{for } l = 1, \ldots, L

and then feeds z(L,m)z^{(L, m)} back as input for recurrence step m+1m+1. The effective depth is L×ML \times M while the parameter count remains LL layers.

Coconut (Chain of Continuous Thought)

Coconut (Hao et al., 2024) replaces discrete chain-of-thought tokens with continuous hidden states. During training:

  1. Start with a standard CoT training example
  2. Replace each reasoning token with a special "thought" token
  3. Instead of embedding the thought token discretely, pass the continuous hidden state from the previous position directly
  4. The model learns to reason in this continuous space

At inference, the model generates a fixed number of continuous thought steps (no discrete tokens emitted), then generates the answer.

Pause Tokens

A simpler variant: insert learnable "pause" tokens into the input. These tokens carry no semantic content but give the model additional forward-pass computation before it must produce an output. The model learns to use these extra positions for internal computation.

Main Theorems

Proposition

Recurrent Depth Increases Expressiveness

Statement

A transformer with LL layers and MM recurrence steps can represent functions that require depth Ω(LM)\Omega(L \cdot M) in a standard (non-recurrent) transformer. Specifically, for Boolean circuit families of depth dd, a recurrent transformer with LL layers and M=d/LM = \lceil d/L \rceil recurrence steps can simulate any circuit in the family, while a non-recurrent transformer requires dd layers.

Intuition

Each recurrence step adds LL layers of effective depth without adding parameters. A recurrent transformer with L=12L = 12 layers and M=10M = 10 recurrence steps has the computational capacity of a 120-layer transformer but the parameter count of a 12-layer one. The trade-off is compute time: MM recurrence steps take MM times as long.

Why It Matters

This formalizes the intuition that latent reasoning trades parameters for compute. A smaller model with more recurrence steps can match the expressiveness of a much larger model with a single pass. This is the theoretical basis for scaling inference compute without scaling model size.

Failure Mode

Expressiveness does not guarantee learnability. A recurrent transformer can represent depth-LMLM circuits, but gradient-based training may not find these solutions. Recurrent computation introduces vanishing/exploding gradient issues analogous to those in RNNs. Residual connections and careful initialization help but do not eliminate the problem. In practice, the number of useful recurrence steps is limited.

The Information Bottleneck Argument

Standard CoT forces all inter-step communication through discrete tokens. Each token is an element of a finite vocabulary V\mathcal{V} with V|\mathcal{V}| possibilities. The information capacity per step is at most log2V\log_2 |\mathcal{V}| bits (about 15 bits for a 32K vocabulary).

Continuous hidden states are vectors in Rd\mathbb{R}^d with dd typically 4096 or more. Even with finite-precision arithmetic, the information capacity per step is vastly larger. Latent reasoning removes the discrete bottleneck, allowing richer information flow between reasoning steps.

This does not mean latent reasoning is strictly better. Discrete tokens provide interpretability (you can read the chain of thought) and serve as a form of regularization (the model must compress its reasoning into human- readable form). Latent reasoning sacrifices both.

Challenges

Training difficulty. Latent reasoning requires learning what to compute in hidden space without the supervision signal that discrete CoT tokens provide. In standard CoT training, each intermediate token has a ground-truth target. In latent reasoning, only the final answer provides supervision. The model must discover useful intermediate representations on its own.

Interpretability loss. With discrete CoT, you can inspect the reasoning trace. With latent reasoning, the computation is opaque. This makes debugging, auditing, and alignment substantially harder.

Optimal depth selection. How many recurrence steps MM should the model use? Too few and the model cannot solve hard problems. Too many and compute is wasted on easy problems. Adaptive depth (dynamically choosing MM per input) is an open problem.

Watch Out

Latent reasoning is not just deeper networks

Adding more layers to a standard transformer increases depth but also increases parameter count. Latent reasoning reuses the same parameters for multiple passes. The distinction matters: a 120-layer transformer has 10x the parameters of a 12-layer one, but a 12-layer recurrent transformer with 10 passes has the same parameters as the base 12-layer model.

Watch Out

Continuous thought is not soft prompting

Soft prompting learns continuous embeddings as inputs to the model. Continuous thought passes continuous hidden states between reasoning steps. Soft prompts are fixed after training. Continuous thought states are dynamically computed at inference time based on the input.

Summary

  • Standard CoT couples reasoning depth to output length; latent reasoning decouples them
  • Recurrent depth: apply the same transformer blocks multiple times, trading compute for effective depth without adding parameters
  • Continuous thought (Coconut): replace discrete reasoning tokens with continuous hidden state propagation
  • Information capacity per reasoning step is much higher in continuous space than in discrete token space
  • Challenges: training is harder without intermediate supervision, interpretability is lost, and adaptive depth is unsolved
  • This is early-stage research with promising but preliminary results

Exercises

ExerciseCore

Problem

A standard transformer has L=24L = 24 layers. A recurrent variant uses L=8L = 8 layers with M=5M = 5 recurrence steps. Compare their effective depth and parameter counts (assume each layer has the same number of parameters).

ExerciseAdvanced

Problem

A vocabulary of size V=32000|\mathcal{V}| = 32000 provides at most log2(32000)15\log_2(32000) \approx 15 bits of information per discrete reasoning token. A continuous hidden state has dimension d=4096d = 4096 with 16-bit floating point entries. What is the theoretical maximum information per continuous reasoning step? Why is this comparison misleading in practice?

References

Canonical:

  • Hao et al., "Training Large Language Models to Reason in a Continuous Latent Space" (Coconut, 2024)

Current:

  • Goyal et al., "Think before you speak: Training Language Models With Pause Tokens" (2024)
  • Dehghani et al., "Universal Transformers" (ICLR 2019)

Next Topics

The natural next steps from latent reasoning:

  • Multi-token prediction: another approach to planning ahead, but in token space rather than hidden state space
  • Memory systems for LLMs: long-term memory complements latent reasoning by providing persistent storage beyond the hidden state

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics