🔗 Source: arXiv

GATED DELTA NETWORKS : IMPROVING MAMBA 2 WITH DELTA RULE

Mechanism: Introduces a gated delta rule that dynamically balances uniform state decay for rapid memory erasure with selective key-value replacement for targeted updates, implemented via an extended chunkwise parallel algorithm optimized for tensor cores.
Nuance: Unlike Mamba2’s uniform scalar decay or DeltaNet’s sequential single-pair updates, this mechanism enables flexible, hardware-optimized memory clearance and precise content modification simultaneously without sacrificing parallel training throughput.

Consistently surpasses Mamba2 and DeltaNet across language modeling, in-context retrieval, length extrapolation, and long-context understanding benchmarks (e.g., +15% gain on TRec).
Hybrid architectures interleaving Gated DeltaNet with sliding window attention achieve superior training throughput and task accuracy on LongBench while maintaining linear-time complexity.

The underlying delta rule faces theoretical expressiveness limits for complex state-tracking beyond TC0 complexity, requiring hybrid designs or extended formalisms for peak performance.
Standalone linear layers may still underperform full attention on highly complex multi-document reasoning tasks without careful architectural tuning or increased depth.