Continuous Thought Machines

Sneiderman, Robby

Beyond LLMS

Continuous Thought Machines

Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.

ResearchTier 3FrontierFrontier watch~35 min

Prerequisites

Neural Odes Equilibrium and Implicit Models

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 5 | tier 3. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Test-Time Training and Adaptive Inference

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard neural networks have fixed computational depth: a 32-layer transformer always runs 32 layers, whether the input is trivial or requires deep reasoning. This is wasteful for easy inputs and insufficient for hard ones.

Continuous Thought Machines (CTMs) replace discrete layers with a continuous-time dynamical system that evolves the network's internal state. The system runs until the state converges or a halting criterion is met. Easy inputs converge quickly (few integration steps). Hard inputs require longer "thinking time." The network adaptively allocates computation based on input difficulty.

This connects neural ODEs (continuous depth), DEQ models (convergence to fixed point), and test-time compute (scaling inference-time reasoning) into a unified framework: internal dynamics that process information through continuous-time evolution.

Current Checkpoint

CTMs are still research infrastructure, not a drop-in replacement for transformers. Their real value is the design question they force: where should adaptive compute live?

Inside the hidden state: CTMs and neural ODE-style models refine a representation over internal time.
Inside the token budget: reasoning models spend more generated tokens on harder inputs.
Inside external search: best-of-N, tree search, and verifiers explore multiple candidate traces.
Inside the learner product: a study system can allocate more practice, review, hints, and diagnostics to weak concepts.

For TheoremPath, the useful lesson is not that we should build CTMs. The lesson is that adaptive compute should be measured. A learner loop should record when more time, more attempts, or more explanation actually changes mastery, instead of assuming longer interaction is better.

Build It This Way by Default

When designing adaptive learning features, log the budget spent and the measured outcome: attempts, hint count, review delay, time-on-task, final correctness, and next-review confidence. Without that pair, the system cannot learn whether extra help was useful or merely longer.

The Formulation

Proposition

Continuous Thought Machine

Statement

A Continuous Thought Machine processes input $x$ by evolving an internal state $s(t) \in \mathbb{R}^d$ under learned dynamics:

$\frac{ds}{dt} = f_\theta(s(t), x, t), \quad s(0) = g_\theta(x)$

where $g_\theta$ encodes the input into the initial state and $f_\theta$ defines the continuous-time "thinking" dynamics. The output is read from the state at a variable time $T$ :

$y = h_\theta(s(T))$

The thinking time $T$ can be:

Fixed: integrate from $t = 0$ to $t = T_{\max}$ (like a neural ODE)
Adaptive: halt when $\|ds/dt\| < \epsilon$ (convergence criterion, like a DEQ)
Learned: a halting network $p_\theta(s(t))$ outputs a probability of stopping at each $t$ (like Adaptive Computation Time)

The key property: the amount of internal computation is not fixed by architecture but determined by the input and the learned dynamics.

Intuition

Think of the network as a dynamical system that "ponders" the input. The initial state encodes the raw input. The dynamics $f_\theta$ refine this state over time: correcting errors, resolving ambiguities, and building up the answer. When the state stabilizes, the answer is ready. Hard problems require more refinement steps (longer integration). Easy problems converge quickly.

This is a formalization of "thinking harder" about a problem. The thinking is not discrete search (as in test-time compute with sampling) but continuous refinement of an internal representation.

Why It Matters

CTMs unify several ideas that are usually treated separately:

Neural ODEs: continuous depth with ODE dynamics
DEQ models: fixed-point convergence as the "answer"
Adaptive Computation Time (ACT): learned halting
Universal transformers: weight-tied transformers with halting
PonderNet: stochastic halting with geometric priors

The CTM framework shows these are all instances of "evolve a state under learned dynamics until ready," with different choices for the dynamics, halting, and training objectives.

Failure Mode

Adaptive halting introduces a chicken-and-egg problem: the network must decide when to stop, but it does not know the right answer until it stops. Training the halting mechanism requires balancing accuracy (think longer = better) against efficiency (think shorter = cheaper). Regularizing the thinking time (penalizing long pondering) can cause the network to halt too early on hard inputs. The ACT ponder penalty $\lambda \sum_t p_t$ trades accuracy for speed, and the right $\lambda$ is problem-dependent.

report a correction →

Connection to Existing Architectures

Architecture	Internal dynamics	Halting	Training
Standard transformer	Discrete, $L$ steps	Fixed ( $L$ layers)	Backprop through $L$
Universal Transformer	Discrete, weight-tied	ACT (learned halting)	Backprop + ACT loss
Neural ODE	Continuous, $ds/dt = f(s,t)$	Fixed $T$	Adjoint method
DEQ	Discrete iteration to fixed point	Convergence ( $\\|s_{t+1} - s_t\\| < \epsilon$ )	Implicit differentiation
PonderNet	Discrete, weight-tied	Stochastic (geometric prior)	REINFORCE + reconstruction
Continuous Thought Machine	Continuous, $ds/dt = f(s,x,t)$	Adaptive (learned or convergence)	Adjoint + halting loss

What This Changes About How We Think

The standard view: a neural network has a fixed computational budget (number of layers). You train it and deploy it. Every input gets the same budget.

The CTM view: a neural network has a learned computational process. Easy inputs terminate early. Hard inputs get more processing. The network allocates its own compute based on the difficulty of the input, without external scheduling.

This aligns with how human cognition works: recognizing a familiar face takes milliseconds; solving a logic puzzle takes minutes. The computational cost scales with the task, not the architecture.

The practical challenge is making this work at scale with stable training, efficient hardware utilization (variable-length computation is hard to batch on GPUs), and reliable halting that does not degenerate to always-minimum or always-maximum computation.

Common Confusions

Watch Out

Adaptive computation is not the same as early exit

Early exit (skipping later layers if confidence is high) is a heuristic applied to a fixed-depth network. The network was trained with all layers; early exit just skips some. CTMs are different: the dynamics are trained to converge at the right time. The training objective rewards both correct answers and efficient convergence. The network learns when to stop, not just when it can skip.

Watch Out

More thinking time does not always help

If the dynamics $f_\theta$ are poorly trained, running them longer can cause the state to diverge, oscillate, or settle on a worse solution. More time only helps when the dynamics are contractive and the loss landscape around the equilibrium is well-behaved. In chaotic dynamics, longer integration amplifies errors. This is why the stability analysis from DEQ models (contraction mapping, spectral radius of the Jacobian) matters.

Exercises

ExerciseCore

Problem

A Universal Transformer with ACT processes two inputs: "2 + 3 = ?" (easy) and a complex logical reasoning problem (hard). The ACT mechanism allocates 3 ponder steps to the easy input and 15 ponder steps to the hard input. Explain why this is more efficient than a standard 15-layer transformer, and what the training cost of this adaptivity is.

ExerciseAdvanced

Problem

In a CTM with dynamics $\frac{ds}{dt} = f_\theta(s, x, t)$ , the halting criterion is $\|f_\theta(s(t), x, t)\| < \epsilon$ . Explain why this is equivalent to converging to an approximate fixed point, and connect this to the DEQ framework.

References

Canonical:

Graves, "Adaptive Computation Time for Recurrent Neural Networks" (2016). The ACT mechanism.
Dehghani et al., "Universal Transformers" (ICLR 2019). Weight-tied transformers with ACT.

Current:

Banino et al., "PonderNet: Learning to Ponder" (ICML 2022). Stochastic halting with geometric prior.
Darlow, Regan, Risi, Seely, Jones (Sakana AI), "Continuous Thought Machines" (NeurIPS 2025, spotlight), arXiv:2505.05522. The CTM paper: neuron-level temporal processing and neural synchronization as a latent representation.
Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018). The ODE framework.

Next Topics

Test-time training: another form of adaptive inference, where the model updates its weights rather than its hidden state
Open problems in ML theory: adaptive computation connects to the broader question of why some inputs need more processing

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Equilibrium and Implicit-Layer Modelslayer 4 · tier 2
Neural ODEs and Continuous-Depth Networkslayer 4 · tier 3

Derived topics

2

Test-Time Training and Adaptive Inferencelayer 5 · tier 2
Open Problems in ML Theorylayer 5 · tier 3

Graph-backed continuations

Test-Time Training and Adaptive Inference Open Problems in ML Theory