query-oriented-sparse-attention
🔗 Source: arXiv
QUOKA: QUERY-ORIENTED KV SELECTION FOR EFFICIENT LLM PREFILL
🚀 Technical Novelty
- Mechanism: Training-free query-oriented KV selection using cosine dissimilarity to identify representative queries, followed by cosine similarity scoring to subselect the most relevant keys/values for chunked prefill.
- Nuance: Differs from prior SOTA by explicitly accounting for query geometry during multi-query prefill chunks, avoiding homogeneous averaging that degrades accuracy, and remaining fully compatible with standard dense kernels without custom hardware dependencies.
💡 Yield
- Near-baseline accuracy on LongBench, RULER, Needle-in-a-Haystack, and Math500 benchmarks while using 88% fewer KV pairs.
- Up to 3× reduction in time-to-first-token (TTFT) and up to 7× attention speedup across Nvidia GPUs, Intel CPUs, and consumer hardware.
- Robust performance across diverse LLM families (Llama3, Qwen3, SmolLM, GPT-OSS) with gradual accuracy degradation under increasing sparsity.
⚠️ Limitations
- Accuracy gradually degrades as sparsity budget increases (though remains stable within ~3% drop for <12% tokens used).
- Computation of the aggregated query-key matrix ($\bar{Q}K^\top$) could be further optimized; authors note potential for exploiting channel sparsity or learned projections.
- Primarily validated on decoder-only LLMs and chunked prefill settings; KV cache eviction integration is left for future work.