🔗 Source: arXiv

End-to-End Test-Time Training for Long Context

Mechanism: Sequentially updates model weights during inference using next-token prediction gradients, initialized via meta-learning to optimize the weight state specifically for test-time learning dynamics.
Nuance: Fully end-to-end differentiable across both training and inference phases; directly optimizes the final test loss via outer-loop meta-gradients rather than relying on heuristic dynamic evaluation or fixed update rules like prior TTT methods.

Matches full-attention Transformer scaling curves up to 128K context while maintaining O(1) decode latency.
Achieves a 2.7× inference speedup over standard full attention on H100 hardware without perplexity degradation.
Demonstrates that long-context modeling can be framed as continual learning rather than architectural redesign.

Inference-time gradient steps increase memory bandwidth pressure and require careful per-context learning rate scheduling.
Relies on next-token prediction loss at test time, which may degrade under distribution shifts or non-stationary contexts.
Evaluated primarily on standard language modeling; instruction-tuning, multi-domain generalization, and real-world deployment remain unexplored.