🔗 Source: arXiv

CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction

🚀 Technical Novelty

  • Mechanism: Computes token eviction error in closed-form by mathematically combining attention scores (query-key alignment) and value vectors to quantify each cached token’s exact contribution to the final attention output.
  • Nuance: Unlike prior eviction methods that rely exclusively on attention scores as a proxy for importance, CAOTE explicitly models value vector contributions and functions as a model-agnostic meta-heuristic that seamlessly plugs into existing score-based strategies (e.g., H2O, SnapKV) for consistent performance gains.

💡 Yield

  • Consistently improves accuracy across LongBench, Needle-in-Haystack retrieval, and perplexity benchmarks on LLaMA3 and Qwen2.5 families when integrated with state-of-the-art eviction baselines.
  • Enables block-wise prompt processing and dynamic KV cache budgeting, effectively mitigating memory/compute bottlenecks for long-context inference on resource-constrained hardware without requiring model retraining or fine-tuning.

⚠️ Limitations

  • Relies on a heuristic closed-form approximation rather than learned parameters; performance may vary across non-standard attention mechanisms or highly specialized architectures.
  • Primarily evaluated on decoder-only LLMs; generalization to multimodal, encoder-decoder, or hybrid models is not explicitly demonstrated.
  • The per-token computational overhead of calculating the CAOTE score during autoregressive generation must be carefully balanced against memory savings in ultra-low-latency deployment scenarios.