🔗 Source: arXiv

Titans: Learning to Memorize at Test Time

🚀 Technical Novelty

  • Mechanism: A deep neural long-term memory module that adaptively stores historical data into its parameters at test time using gradient-based surprise measures, momentum, and weight decay (forgetting).
  • Nuance: Unlike standard ICL (which relies on static prompt context) or linear RNNs/Transformers (which compress or cache history in fixed states), Titans explicitly updates weights during inference to memorize “surprising” events, decoupling short-term attention from long-term storage for superior length extrapolation.

💡 Yield

  • Outperforms Transformers and modern linear recurrent models (Mamba, HyenaDNA) across language modeling, commonsense reasoning, time series forecasting, and genomics benchmarks.
  • Scales to >2M context windows with high accuracy in needle-in-haystack tasks while maintaining fast, parallelizable training via tensorized mini-batch gradient descent.

⚠️ Limitations

  • Training throughput is slightly slower than highly optimized kernels like Mamba2 due to the computational overhead of deep memory updates and convolutional operations.
  • Architectural variants present a trade-off between expressiveness and speed, requiring careful tuning of decay and momentum hyperparameters for optimal memory management across tasks.