Beyond Llms
Test-Time Training and Adaptive Inference
Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.
Why This Matters
Standard neural networks freeze their weights after training. At inference time, the same weights process every input. Transformers use attention with a KV cache that grows with context. Mamba and other SSMs compress context into a fixed-size state vector, keeping the dynamics fixed.
Test-Time Training (TTT) is a different approach entirely. It is not built on Mamba or transformers. A TTT layer updates its own parameters on each input using a self-supervised gradient step. The "hidden state" is not a vector (as in SSMs) or a KV cache (as in attention), but a set of weights that are updated by gradient descent at every token.
TTT, Mamba, and attention are three competing architectures for sequence modeling. They solve the same problem (processing sequences with bounded compute per token) through different mechanisms. TTT uses gradient-based learning as its recurrence. Mamba uses structured state spaces. Attention uses content-based retrieval. TTT has shown stronger performance than Mamba on long-context tasks (beyond 16k tokens) because gradient-based weight updates can store more retrievable information than a fixed-size linear state.
The TTT Layer
TTT Layer: Learning as State Update
Statement
A TTT layer maintains a weight matrix that is updated at each timestep via a self-supervised gradient step:
where is a self-supervised loss (e.g., reconstruction loss, masked prediction loss) evaluated on the current input . The output at step is computed using the updated weights:
This replaces the linear recurrence of SSMs with a gradient-based update: the "state" is the weight matrix , and the "transition function" is one step of gradient descent.
Intuition
In a standard SSM, the hidden state is a compressed summary of the context, and the compression is linear (matrix multiply). This limits what can be stored. In a TTT layer, the "state" is an entire weight matrix, and the "compression" is gradient descent on a loss function. Gradient descent is a much richer update rule: it can store key-value associations, learn patterns, and adapt to the specific distribution of the current context.
The self-supervised loss acts as the "what to remember" signal. If the loss is reconstruction (), the weights learn to reconstruct the input distribution seen so far. If it is masked prediction, the weights learn to predict missing tokens.
Why It Matters
TTT layers achieve the expressiveness benefits of attention (content-addressable memory, ability to store and retrieve arbitrary information) with the linear-time complexity of SSMs (the state update is per step, independent of context length). The key insight: the weight matrix can store bits of information, compared to for Mamba (where typically). This is why TTT layers can handle longer contexts without losing information.
Failure Mode
The gradient step at each position adds computation: you must compute at every token, which involves a forward and backward pass through the TTT layer's inner model. This is more expensive per step than Mamba's linear recurrence. The learning rate is critical: too large and the weights oscillate, too small and the layer does not adapt. In practice, a linear self-supervised model (TTT-Linear, where ) keeps the cost manageable while still outperforming fixed recurrences.
TTT vs Attention vs SSM
| Property | Attention | SSM (Mamba) | TTT |
|---|---|---|---|
| State type | KV cache (grows with context) | Fixed-size vector | Weight matrix (fixed size) |
| State size | , | ||
| Per-token cost | |||
| Retrieval ability | Exact (attends to all tokens) | Approximate (compressed) | Learned (gradient-based) |
| Information capacity | Unbounded (grows with ) | bits | bits |
| Parallelizable | Yes (within sequence) | Yes (parallel scan) | Partially (mini-batch of tokens) |
Connection to Online Learning and Meta-Learning
TTT is online learning applied inside a neural network layer. At each timestep, the layer receives a new data point , updates its parameters to reduce the self-supervised loss, and makes a prediction. This is exactly the online convex optimization framework, except the loss is self-supervised (no labels needed) and the "prediction" feeds into the rest of the network.
The connection to meta-learning is direct: the outer network (trained end-to-end) learns the self-supervised loss, learning rate, and initialization that make the inner TTT updates maximally useful. The outer training optimizes the learning rule; the inner TTT applies it at test time. This is "learning to learn" in a literal sense.
Common Confusions
TTT is not fine-tuning at test time
Fine-tuning updates the entire model on a labeled dataset. TTT updates a single layer's weights using a self-supervised loss on the current input, with no labels. The update is local (one layer), fast (one gradient step per token), and unsupervised. It is closer to online learning than to fine-tuning.
TTT on evaluation data or already-seen tokens is data leakage
If you apply TTT during evaluation where the self-supervised loss uses tokens the model will be scored on, you are leaking test information into the model's parameters. The self-supervised update must use only tokens that precede the prediction target in the sequence. For causal language modeling, this means the TTT update at position can only use tokens , not itself. Violating this produces artificially low perplexity numbers that do not reflect real-world performance. This is the same principle as train-test split hygiene: the model cannot learn from the data it is being evaluated on.
The weight matrix state is not the same as attention KV cache
Both grow in information content, but through different mechanisms. KV cache stores exact key-value pairs and grows linearly with context length. TTT weight state has fixed size () but stores information through gradient-based compression, which can be lossy. Very long contexts may lose early information if later gradients overwrite it (catastrophic forgetting within a sequence).
Exercises
Problem
A TTT-Linear layer has with self-supervised loss . Compute the gradient and describe what the weight update is doing geometrically.
Problem
Compare the information capacity of a Mamba state (, ) with a TTT weight state (, so the inner model has ). Which can store more information, and under what assumptions?
References
Canonical:
- Sun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (ICML 2024). The TTT paper.
- Sun et al., "TTT-MLP and TTT-Linear: Practical Implementations" (2024)
Current:
-
Irie et al., "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention" (ICML 2022). Related dual-form perspective.
-
Zhang et al., Dive into Deep Learning (2023), Chapters 14-17
-
Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapters 15-25
Next Topics
- Online learning and bandits: the theoretical framework for learning from sequential data
- Meta-learning: learning the learning rule itself
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Stochastic Gradient Descent ConvergenceLayer 2
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Recurrent Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1