🔗 Source: arXiv

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

🚀 Technical Novelty

  • Mechanism: Derives exact mathematical equivalences between linear attention and DeltaNet under inverse key Gram preconditioning, then implements a practical diagonal approximation paired with efficient chunkwise parallel algorithms for online least-squares updates.
  • Nuance: Prior delta-rule models apply uniform decay or ignore loss curvature during online optimization; this method dynamically scales the write key via learned diagonal preconditioners, capturing second-order geometry while preserving sub-quadratic compute and stable parallel training.

💡 Yield

  • Theoretical unification proving that exact inverse Gram preconditioning makes linear attention and DeltaNet mathematically equivalent, with stable diagonal approximations enabling efficient chunkwise parallel forms.
  • Consistent perplexity and zero-shot accuracy gains across 340M/1B scale language models on commonsense reasoning and in-context retrieval (S-NIAH) benchmarks, with improved write-eigenvalue expressivity for long-context recall.

⚠️ Limitations

  • Diagonal approximation discards cross-dimensional curvature information present in the full key Gram matrix, potentially limiting optimization fidelity in high-dimensional feature spaces.
  • Evaluations are restricted to 340M/1B scales and synthetic recall tasks; scalability to larger models or extended contexts requires further validation.
  • Cross-architectural comparisons (e.g., vs. KDA) are cautioned against due to fixed configuration choices rather than fully optimized baseline tuning.