LLM Construction
Latent Reasoning
Reasoning in hidden state space instead of generating chain-of-thought tokens: recurrent computation and continuous thought for scaling inference compute without scaling output length.
Prerequisites
Why This Matters
Chain-of-thought (CoT) prompting improves LLM reasoning by generating intermediate tokens that externalize the computation. But this approach has a cost: every reasoning token consumes output bandwidth, increases latency, and is visible to the user. For a problem requiring 1000 reasoning steps, the model must generate 1000 tokens before producing an answer.
Latent reasoning asks: can we perform the same computation inside the model's hidden states, without generating visible tokens? Instead of producing a chain of tokens , the model iterates its hidden state for multiple steps, then produces the answer directly from the final hidden state.
This is early-stage research. The motivation: decouple inference compute from output length.
Mental Model
Standard CoT is like working a math problem by writing out every step on paper. Latent reasoning is like solving it in your head, only writing down the final answer. The internal computation still happens, but it operates on continuous vectors rather than discrete tokens. The "scratchpad" is the hidden state, not the output sequence.
Core Concepts
Chain-of-Thought Bottleneck
In standard chain-of-thought reasoning, inference compute scales linearly with the number of generated tokens. Each token requires a full forward pass through the model. The CoT bottleneck is the constraint that reasoning depth is coupled to output sequence length: more computation requires more output tokens.
Latent Reasoning
Latent reasoning performs iterative computation in the model's hidden state space without generating intermediate tokens. Given hidden state at position , the model applies a recurrent function for steps:
The final state is then used to predict the next token. The parameter controls reasoning depth independently of output length.
Continuous Thought
Continuous thought replaces discrete token representations with continuous vector embeddings in the reasoning process. Instead of sampling a token , embedding it, and feeding it back, the model directly passes the continuous hidden state to the next iteration. This eliminates the information bottleneck of discrete tokenization during reasoning.
Approaches
Recurrent Depth
The simplest form: apply the same transformer block (or a subset of blocks) multiple times. A standard transformer with layers computes:
A recurrent-depth model with layers and recurrence steps computes:
and then feeds back as input for recurrence step . The effective depth is while the parameter count remains layers.
Coconut (Chain of Continuous Thought)
Coconut (Hao et al., 2024) replaces discrete chain-of-thought tokens with continuous hidden states. During training:
- Start with a standard CoT training example
- Replace each reasoning token with a special "thought" token
- Instead of embedding the thought token discretely, pass the continuous hidden state from the previous position directly
- The model learns to reason in this continuous space
At inference, the model generates a fixed number of continuous thought steps (no discrete tokens emitted), then generates the answer.
Pause Tokens
A simpler variant: insert learnable "pause" tokens into the input. These tokens carry no semantic content but give the model additional forward-pass computation before it must produce an output. The model learns to use these extra positions for internal computation.
Main Theorems
Recurrent Depth Increases Expressiveness
Statement
A transformer with layers and recurrence steps can represent functions that require depth in a standard (non-recurrent) transformer. Specifically, for Boolean circuit families of depth , a recurrent transformer with layers and recurrence steps can simulate any circuit in the family, while a non-recurrent transformer requires layers.
Intuition
Each recurrence step adds layers of effective depth without adding parameters. A recurrent transformer with layers and recurrence steps has the computational capacity of a 120-layer transformer but the parameter count of a 12-layer one. The trade-off is compute time: recurrence steps take times as long.
Why It Matters
This formalizes the intuition that latent reasoning trades parameters for compute. A smaller model with more recurrence steps can match the expressiveness of a much larger model with a single pass. This is the theoretical basis for scaling inference compute without scaling model size.
Failure Mode
Expressiveness does not guarantee learnability. A recurrent transformer can represent depth- circuits, but gradient-based training may not find these solutions. Recurrent computation introduces vanishing/exploding gradient issues analogous to those in RNNs. Residual connections and careful initialization help but do not eliminate the problem. In practice, the number of useful recurrence steps is limited.
The Information Bottleneck Argument
Standard CoT forces all inter-step communication through discrete tokens. Each token is an element of a finite vocabulary with possibilities. The information capacity per step is at most bits (about 15 bits for a 32K vocabulary).
Continuous hidden states are vectors in with typically 4096 or more. Even with finite-precision arithmetic, the information capacity per step is vastly larger. Latent reasoning removes the discrete bottleneck, allowing richer information flow between reasoning steps.
This does not mean latent reasoning is strictly better. Discrete tokens provide interpretability (you can read the chain of thought) and serve as a form of regularization (the model must compress its reasoning into human- readable form). Latent reasoning sacrifices both.
Challenges
Training difficulty. Latent reasoning requires learning what to compute in hidden space without the supervision signal that discrete CoT tokens provide. In standard CoT training, each intermediate token has a ground-truth target. In latent reasoning, only the final answer provides supervision. The model must discover useful intermediate representations on its own.
Interpretability loss. With discrete CoT, you can inspect the reasoning trace. With latent reasoning, the computation is opaque. This makes debugging, auditing, and alignment substantially harder.
Optimal depth selection. How many recurrence steps should the model use? Too few and the model cannot solve hard problems. Too many and compute is wasted on easy problems. Adaptive depth (dynamically choosing per input) is an open problem.
Latent reasoning is not just deeper networks
Adding more layers to a standard transformer increases depth but also increases parameter count. Latent reasoning reuses the same parameters for multiple passes. The distinction matters: a 120-layer transformer has 10x the parameters of a 12-layer one, but a 12-layer recurrent transformer with 10 passes has the same parameters as the base 12-layer model.
Continuous thought is not soft prompting
Soft prompting learns continuous embeddings as inputs to the model. Continuous thought passes continuous hidden states between reasoning steps. Soft prompts are fixed after training. Continuous thought states are dynamically computed at inference time based on the input.
Summary
- Standard CoT couples reasoning depth to output length; latent reasoning decouples them
- Recurrent depth: apply the same transformer blocks multiple times, trading compute for effective depth without adding parameters
- Continuous thought (Coconut): replace discrete reasoning tokens with continuous hidden state propagation
- Information capacity per reasoning step is much higher in continuous space than in discrete token space
- Challenges: training is harder without intermediate supervision, interpretability is lost, and adaptive depth is unsolved
- This is early-stage research with promising but preliminary results
Exercises
Problem
A standard transformer has layers. A recurrent variant uses layers with recurrence steps. Compare their effective depth and parameter counts (assume each layer has the same number of parameters).
Problem
A vocabulary of size provides at most bits of information per discrete reasoning token. A continuous hidden state has dimension with 16-bit floating point entries. What is the theoretical maximum information per continuous reasoning step? Why is this comparison misleading in practice?
References
Canonical:
- Hao et al., "Training Large Language Models to Reason in a Continuous Latent Space" (Coconut, 2024)
Current:
- Goyal et al., "Think before you speak: Training Language Models With Pause Tokens" (2024)
- Dehghani et al., "Universal Transformers" (ICLR 2019)
Next Topics
The natural next steps from latent reasoning:
- Multi-token prediction: another approach to planning ahead, but in token space rather than hidden state space
- Memory systems for LLMs: long-term memory complements latent reasoning by providing persistent storage beyond the hidden state
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Test-Time Compute and SearchLayer 5
- Scaling LawsLayer 4
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A