Gist Sparse Attention

🔗 Source: arXiv

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

🚀 Technical Novelty

Mechanism: Interleaved learnable gist tokens compress raw contexts into dense summaries, serving as differentiable routing signals to selectively unfold and attend only to the most relevant raw token chunks.
Nuance: Bridges compression and training-time sparsity end-to-end within standard Transformers, avoiding the architectural modifications of NSA, external indexers of DSA, and non-differentiable pooling gates of MoBA.

💡 Yield

Achieves log-linear per-step decoding complexity O(log n) via recursive gist-of-gist construction.
Consistently outperforms compression baselines and inference-time sparse attention on LongBench/RAG benchmarks, delivering 8–12 point gains after fine-tuning under identical token budgets (8× to 32× compression).

⚠️ Limitations

Selective finetuning requires position-dependent sparse masks that may necessitate custom CUDA kernels for efficient implementation.
Hierarchical depth and chunking parameters require careful tuning to balance compression ratio against retrieval fidelity, and the coarse-to-fine routing assumes structured context dependencies rather than uniformly dense ones.