🔗 Source: arXiv

GATED DELTA NETWORKS : IMPROVING MAMBA 2 WITH DELTA RULE

Mechanism: Introduces a gated delta rule that dynamically balances rapid memory erasure (via gating) with targeted key-value pair updates (via the delta rule), implemented through an extended chunkwise parallel algorithm optimized for tensor cores.
Nuance: Unlike Mamba2’s uniform scalar decay or DeltaNet’s sequential single-pair updates, this mechanism enables flexible, content-aware memory management while preserving linear-time training and hardware efficiency.

Consistently surpasses Mamba2 and DeltaNet on language modeling, in-context retrieval, length extrapolation, and long-context understanding benchmarks (e.g., +15% accuracy on Multi-Doc QA).
Hybrid variants (Gated DeltaNet-H1/H2) interleaved with sliding window attention or Mamba2 layers achieve superior training throughput across all sequence lengths without sacrificing task performance.

The underlying delta rule faces theoretical expressiveness limits compared to full softmax attention, requiring careful parameterization (e.g., negative eigenvalues) to unlock state-tracking capabilities.
Hybrid architectures introduce architectural complexity and require empirical tuning of layer composition to balance throughput gains with memory overhead.