Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Continuous Thought Machines

Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.

ResearchTier 3Frontier~35 min
0

Why This Matters

Standard neural networks have fixed computational depth: a 32-layer transformer always runs 32 layers, whether the input is trivial or requires deep reasoning. This is wasteful for easy inputs and insufficient for hard ones.

Continuous Thought Machines (CTMs) replace discrete layers with a continuous-time dynamical system that evolves the network's internal state. The system runs until the state converges or a halting criterion is met. Easy inputs converge quickly (few integration steps). Hard inputs require longer "thinking time." The network adaptively allocates computation based on input difficulty.

This connects neural ODEs (continuous depth), DEQ models (convergence to fixed point), and test-time compute (scaling inference-time reasoning) into a unified framework: internal dynamics that process information through continuous-time evolution.

The Formulation

Proposition

Continuous Thought Machine

Statement

A Continuous Thought Machine processes input xx by evolving an internal state s(t)Rds(t) \in \mathbb{R}^d under learned dynamics:

dsdt=fθ(s(t),x,t),s(0)=gθ(x)\frac{ds}{dt} = f_\theta(s(t), x, t), \quad s(0) = g_\theta(x)

where gθg_\theta encodes the input into the initial state and fθf_\theta defines the continuous-time "thinking" dynamics. The output is read from the state at a variable time TT:

y=hθ(s(T))y = h_\theta(s(T))

The thinking time TT can be:

  1. Fixed: integrate from t=0t = 0 to t=Tmaxt = T_{\max} (like a neural ODE)
  2. Adaptive: halt when ds/dt<ϵ\|ds/dt\| < \epsilon (convergence criterion, like a DEQ)
  3. Learned: a halting network pθ(s(t))p_\theta(s(t)) outputs a probability of stopping at each tt (like Adaptive Computation Time)

The key property: the amount of internal computation is not fixed by architecture but determined by the input and the learned dynamics.

Intuition

Think of the network as a dynamical system that "ponders" the input. The initial state encodes the raw input. The dynamics fθf_\theta refine this state over time: correcting errors, resolving ambiguities, and building up the answer. When the state stabilizes, the answer is ready. Hard problems require more refinement steps (longer integration). Easy problems converge quickly.

This is a formalization of "thinking harder" about a problem. The thinking is not discrete search (as in test-time compute with sampling) but continuous refinement of an internal representation.

Why It Matters

CTMs unify several ideas that are usually treated separately:

  • Neural ODEs: continuous depth with ODE dynamics
  • DEQ models: fixed-point convergence as the "answer"
  • Adaptive Computation Time (ACT): learned halting
  • Universal transformers: weight-tied transformers with halting
  • PonderNet: stochastic halting with geometric priors

The CTM framework shows these are all instances of "evolve a state under learned dynamics until ready," with different choices for the dynamics, halting, and training objectives.

Failure Mode

Adaptive halting introduces a chicken-and-egg problem: the network must decide when to stop, but it does not know the right answer until it stops. Training the halting mechanism requires balancing accuracy (think longer = better) against efficiency (think shorter = cheaper). Regularizing the thinking time (penalizing long pondering) can cause the network to halt too early on hard inputs. The ACT ponder penalty λtpt\lambda \sum_t p_t trades accuracy for speed, and the right λ\lambda is problem-dependent.

Connection to Existing Architectures

ArchitectureInternal dynamicsHaltingTraining
Standard transformerDiscrete, LL stepsFixed (LL layers)Backprop through LL
Universal TransformerDiscrete, weight-tiedACT (learned halting)Backprop + ACT loss
Neural ODEContinuous, ds/dt=f(s,t)ds/dt = f(s,t)Fixed TTAdjoint method
DEQDiscrete iteration to fixed pointConvergence (st+1st<ϵ\|s_{t+1} - s_t\| < \epsilon)Implicit differentiation
PonderNetDiscrete, weight-tiedStochastic (geometric prior)REINFORCE + reconstruction
Continuous Thought MachineContinuous, ds/dt=f(s,x,t)ds/dt = f(s,x,t)Adaptive (learned or convergence)Adjoint + halting loss

What This Changes About How We Think

The standard view: a neural network has a fixed computational budget (number of layers). You train it and deploy it. Every input gets the same budget.

The CTM view: a neural network has a learned computational process. Easy inputs terminate early. Hard inputs get more processing. The network allocates its own compute based on the difficulty of the input, without external scheduling.

This aligns with how human cognition works: recognizing a familiar face takes milliseconds; solving a logic puzzle takes minutes. The computational cost scales with the task, not the architecture.

The practical challenge is making this work at scale with stable training, efficient hardware utilization (variable-length computation is hard to batch on GPUs), and reliable halting that does not degenerate to always-minimum or always-maximum computation.

Common Confusions

Watch Out

Adaptive computation is not the same as early exit

Early exit (skipping later layers if confidence is high) is a heuristic applied to a fixed-depth network. The network was trained with all layers; early exit just skips some. CTMs are different: the dynamics are trained to converge at the right time. The training objective rewards both correct answers and efficient convergence. The network learns when to stop, not just when it can skip.

Watch Out

More thinking time does not always help

If the dynamics fθf_\theta are poorly trained, running them longer can cause the state to diverge, oscillate, or settle on a worse solution. More time only helps when the dynamics are contractive and the loss landscape around the equilibrium is well-behaved. In chaotic dynamics, longer integration amplifies errors. This is why the stability analysis from DEQ models (contraction mapping, spectral radius of the Jacobian) matters.

Exercises

ExerciseCore

Problem

A Universal Transformer with ACT processes two inputs: "2 + 3 = ?" (easy) and a complex logical reasoning problem (hard). The ACT mechanism allocates 3 ponder steps to the easy input and 15 ponder steps to the hard input. Explain why this is more efficient than a standard 15-layer transformer, and what the training cost of this adaptivity is.

ExerciseAdvanced

Problem

In a CTM with dynamics dsdt=fθ(s,x,t)\frac{ds}{dt} = f_\theta(s, x, t), the halting criterion is fθ(s(t),x,t)<ϵ\|f_\theta(s(t), x, t)\| < \epsilon. Explain why this is equivalent to converging to an approximate fixed point, and connect this to the DEQ framework.

References

Canonical:

  • Graves, "Adaptive Computation Time for Recurrent Neural Networks" (2016). The ACT mechanism.
  • Dehghani et al., "Universal Transformers" (ICLR 2019). Weight-tied transformers with ACT.

Current:

  • Banino et al., "PonderNet: Learning to Ponder" (ICML 2022). Stochastic halting with geometric prior.
  • Heek, "Continuous Thought Machines" (2025). Continuous-time formulation.
  • Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018). The ODE framework.

Next Topics

  • Test-time training: another form of adaptive inference, where the model updates its weights rather than its hidden state
  • Open problems in ML theory: adaptive computation connects to the broader question of why some inputs need more processing

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics