Hardware-Efficient Gated Delta Networks
🔗 Source: arXiv
GATED DELTA NETWORKS : IMPROVING MAMBA 2 WITH DELTA RULE
🚀 Technical Novelty
- Mechanism: Introduces a gated delta rule that dynamically balances rapid memory erasure (via gating) with targeted key-value pair updates (via the delta rule), implemented through an extended chunkwise parallel algorithm optimized for tensor cores.
- Nuance: Unlike Mamba2’s uniform scalar decay or DeltaNet’s sequential single-pair updates, this mechanism enables flexible, content-aware memory management while preserving linear-time training and hardware efficiency.
💡 Yield
- Consistently surpasses Mamba2 and DeltaNet on language modeling, in-context retrieval, length extrapolation, and long-context understanding benchmarks (e.g., +15% accuracy on Multi-Doc QA).
- Hybrid variants (Gated DeltaNet-H1/H2) interleaved with sliding window attention or Mamba2 layers achieve superior training throughput across all sequence lengths without sacrificing task performance.
⚠️ Limitations
- The underlying delta rule faces theoretical expressiveness limits compared to full softmax attention, requiring careful parameterization (e.g., negative eigenvalues) to unlock state-tracking capabilities.
- Hybrid architectures introduce architectural complexity and require empirical tuning of layer composition to balance throughput gains with memory overhead.