Beyond Llms
Continuous Thought Machines
Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.
Prerequisites
Why This Matters
Standard neural networks have fixed computational depth: a 32-layer transformer always runs 32 layers, whether the input is trivial or requires deep reasoning. This is wasteful for easy inputs and insufficient for hard ones.
Continuous Thought Machines (CTMs) replace discrete layers with a continuous-time dynamical system that evolves the network's internal state. The system runs until the state converges or a halting criterion is met. Easy inputs converge quickly (few integration steps). Hard inputs require longer "thinking time." The network adaptively allocates computation based on input difficulty.
This connects neural ODEs (continuous depth), DEQ models (convergence to fixed point), and test-time compute (scaling inference-time reasoning) into a unified framework: internal dynamics that process information through continuous-time evolution.
The Formulation
Continuous Thought Machine
Statement
A Continuous Thought Machine processes input by evolving an internal state under learned dynamics:
where encodes the input into the initial state and defines the continuous-time "thinking" dynamics. The output is read from the state at a variable time :
The thinking time can be:
- Fixed: integrate from to (like a neural ODE)
- Adaptive: halt when (convergence criterion, like a DEQ)
- Learned: a halting network outputs a probability of stopping at each (like Adaptive Computation Time)
The key property: the amount of internal computation is not fixed by architecture but determined by the input and the learned dynamics.
Intuition
Think of the network as a dynamical system that "ponders" the input. The initial state encodes the raw input. The dynamics refine this state over time: correcting errors, resolving ambiguities, and building up the answer. When the state stabilizes, the answer is ready. Hard problems require more refinement steps (longer integration). Easy problems converge quickly.
This is a formalization of "thinking harder" about a problem. The thinking is not discrete search (as in test-time compute with sampling) but continuous refinement of an internal representation.
Why It Matters
CTMs unify several ideas that are usually treated separately:
- Neural ODEs: continuous depth with ODE dynamics
- DEQ models: fixed-point convergence as the "answer"
- Adaptive Computation Time (ACT): learned halting
- Universal transformers: weight-tied transformers with halting
- PonderNet: stochastic halting with geometric priors
The CTM framework shows these are all instances of "evolve a state under learned dynamics until ready," with different choices for the dynamics, halting, and training objectives.
Failure Mode
Adaptive halting introduces a chicken-and-egg problem: the network must decide when to stop, but it does not know the right answer until it stops. Training the halting mechanism requires balancing accuracy (think longer = better) against efficiency (think shorter = cheaper). Regularizing the thinking time (penalizing long pondering) can cause the network to halt too early on hard inputs. The ACT ponder penalty trades accuracy for speed, and the right is problem-dependent.
Connection to Existing Architectures
| Architecture | Internal dynamics | Halting | Training |
|---|---|---|---|
| Standard transformer | Discrete, steps | Fixed ( layers) | Backprop through |
| Universal Transformer | Discrete, weight-tied | ACT (learned halting) | Backprop + ACT loss |
| Neural ODE | Continuous, | Fixed | Adjoint method |
| DEQ | Discrete iteration to fixed point | Convergence () | Implicit differentiation |
| PonderNet | Discrete, weight-tied | Stochastic (geometric prior) | REINFORCE + reconstruction |
| Continuous Thought Machine | Continuous, | Adaptive (learned or convergence) | Adjoint + halting loss |
What This Changes About How We Think
The standard view: a neural network has a fixed computational budget (number of layers). You train it and deploy it. Every input gets the same budget.
The CTM view: a neural network has a learned computational process. Easy inputs terminate early. Hard inputs get more processing. The network allocates its own compute based on the difficulty of the input, without external scheduling.
This aligns with how human cognition works: recognizing a familiar face takes milliseconds; solving a logic puzzle takes minutes. The computational cost scales with the task, not the architecture.
The practical challenge is making this work at scale with stable training, efficient hardware utilization (variable-length computation is hard to batch on GPUs), and reliable halting that does not degenerate to always-minimum or always-maximum computation.
Common Confusions
Adaptive computation is not the same as early exit
Early exit (skipping later layers if confidence is high) is a heuristic applied to a fixed-depth network. The network was trained with all layers; early exit just skips some. CTMs are different: the dynamics are trained to converge at the right time. The training objective rewards both correct answers and efficient convergence. The network learns when to stop, not just when it can skip.
More thinking time does not always help
If the dynamics are poorly trained, running them longer can cause the state to diverge, oscillate, or settle on a worse solution. More time only helps when the dynamics are contractive and the loss landscape around the equilibrium is well-behaved. In chaotic dynamics, longer integration amplifies errors. This is why the stability analysis from DEQ models (contraction mapping, spectral radius of the Jacobian) matters.
Exercises
Problem
A Universal Transformer with ACT processes two inputs: "2 + 3 = ?" (easy) and a complex logical reasoning problem (hard). The ACT mechanism allocates 3 ponder steps to the easy input and 15 ponder steps to the hard input. Explain why this is more efficient than a standard 15-layer transformer, and what the training cost of this adaptivity is.
Problem
In a CTM with dynamics , the halting criterion is . Explain why this is equivalent to converging to an approximate fixed point, and connect this to the DEQ framework.
References
Canonical:
- Graves, "Adaptive Computation Time for Recurrent Neural Networks" (2016). The ACT mechanism.
- Dehghani et al., "Universal Transformers" (ICLR 2019). Weight-tied transformers with ACT.
Current:
- Banino et al., "PonderNet: Learning to Ponder" (ICML 2022). Stochastic halting with geometric prior.
- Heek, "Continuous Thought Machines" (2025). Continuous-time formulation.
- Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018). The ODE framework.
Next Topics
- Test-time training: another form of adaptive inference, where the model updates its weights rather than its hidden state
- Open problems in ML theory: adaptive computation connects to the broader question of why some inputs need more processing
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Neural ODEs and Continuous-Depth NetworksLayer 4
- Skip Connections and ResNetsLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Gradient Flow and Vanishing GradientsLayer 2
- Automatic DifferentiationLayer 1
- Equilibrium and Implicit-Layer ModelsLayer 4
- Implicit DifferentiationLayer 2