🔗 Source: arXiv

Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs

🚀 Technical Novelty

  • Mechanism: A hierarchical two-stage top-p framework that performs coarse cluster-level attention mass estimation followed by adaptive token-level refinement, paired with GPU-optimized kernels to minimize selection overhead.
  • Nuance: Replaces rigid fixed-budget (top-k) or single-stage top-p selection with dynamic, step-wise budget allocation that guarantees preserved attention mass while eliminating the linear selection overhead and constraint violations of prior methods.

💡 Yield

  • Achieves near-zero accuracy drop on RULER and LongBench benchmarks up to 128K context lengths compared to full attention.
  • Delivers up to 1.78× attention-level speedup and 1.26× end-to-end decoding speedup over state-of-the-art sparse baselines (Quest, RetroInfer).
  • Eliminates target mass violation rates inherent in fixed-budget methods across heterogeneous heads and layers.

⚠️ Limitations

  • Requires empirical tuning of hierarchical thresholds (p1, p2) to balance accuracy and latency per model/task.
  • Evaluated primarily on LLaMA3.1-8B and Qwen3-8B; generalization to other architectures or training paradigms is not explicitly validated.