🔗 Source: arXiv

QUOKA: QUERY-ORIENTED KV SELECTION FOR EFFICIENT LLM PREFILL

Mechanism: Training-free query-oriented KV selection using cosine dissimilarity to identify representative queries, followed by cosine similarity scoring to subselect the most relevant keys/values for chunked prefill.
Nuance: Differs from prior SOTA by explicitly accounting for query geometry during multi-query prefill chunks, avoiding homogeneous averaging that degrades accuracy, and remaining fully compatible with standard dense kernels without custom hardware dependencies.

Near-baseline accuracy on LongBench, RULER, Needle-in-a-Haystack, and Math500 benchmarks while using 88% fewer KV pairs.
Up to 3× reduction in time-to-first-token (TTFT) and up to 7× attention speedup across Nvidia GPUs, Intel CPUs, and consumer hardware.
Robust performance across diverse LLM families (Llama3, Qwen3, SmolLM, GPT-OSS) with gradual accuracy degradation under increasing sparsity.

Accuracy gradually degrades as sparsity budget increases (though remains stable within ~3% drop for <12% tokens used).
Computation of the aggregated query-key matrix ($\bar{Q}K^\top$) could be further optimized; authors note potential for exploiting channel sparsity or learned projections.
Primarily validated on decoder-only LLMs and chunked prefill settings; KV cache eviction integration is left for future work.