End-to-End Test-Time Training
🔗 Source: arXiv
End-to-End Test-Time Training for Long Context
🚀 Technical Novelty
- Mechanism: Sequentially updates model weights during inference using next-token prediction gradients, initialized via meta-learning to optimize the weight state specifically for test-time learning dynamics.
- Nuance: Fully end-to-end differentiable across both training and inference phases; directly optimizes the final test loss via outer-loop meta-gradients rather than relying on heuristic dynamic evaluation or fixed update rules like prior TTT methods.
💡 Yield
- Matches full-attention Transformer scaling curves up to 128K context while maintaining O(1) decode latency.
- Achieves a 2.7× inference speedup over standard full attention on H100 hardware without perplexity degradation.
- Demonstrates that long-context modeling can be framed as continual learning rather than architectural redesign.
⚠️ Limitations
- Inference-time gradient steps increase memory bandwidth pressure and require careful per-context learning rate scheduling.
- Relies on next-token prediction loss at test time, which may degrade under distribution shifts or non-stationary contexts.
- Evaluated primarily on standard language modeling; instruction-tuning, multi-domain generalization, and real-world deployment remain unexplored.