🔗 Source: arXiv

GATED DELTA NETWORKS : IMPROVING MAMBA 2 WITH DELTA RULE

🚀 Technical Novelty

  • Mechanism: Introduces a gated delta rule that dynamically balances uniform state decay for rapid memory erasure with selective key-value replacement for targeted updates, implemented via an extended chunkwise parallel algorithm optimized for tensor cores.
  • Nuance: Unlike Mamba2’s uniform scalar decay or DeltaNet’s sequential single-pair updates, this mechanism enables flexible, hardware-optimized memory clearance and precise content modification simultaneously without sacrificing parallel training throughput.

💡 Yield

  • Consistently surpasses Mamba2 and DeltaNet across language modeling, in-context retrieval, length extrapolation, and long-context understanding benchmarks (e.g., +15% gain on TRec).
  • Hybrid architectures interleaving Gated DeltaNet with sliding window attention achieve superior training throughput and task accuracy on LongBench while maintaining linear-time complexity.

⚠️ Limitations

  • The underlying delta rule faces theoretical expressiveness limits for complex state-tracking beyond TC0 complexity, requiring hybrid designs or extended formalisms for peak performance.
  • Standalone linear layers may still underperform full attention on highly complex multi-document reasoning tasks without careful architectural tuning or increased depth.