🔗 Source: arXiv

CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction

Mechanism: Computes token eviction error in closed-form by mathematically combining attention scores (query-key alignment) and value vectors to quantify each cached token’s exact contribution to the final attention output.
Nuance: Unlike prior eviction methods that rely exclusively on attention scores as a proxy for importance, CAOTE explicitly models value vector contributions and functions as a model-agnostic meta-heuristic that seamlessly plugs into existing score-based strategies (e.g., H2O, SnapKV) for consistent performance gains.

Consistently improves accuracy across LongBench, Needle-in-Haystack retrieval, and perplexity benchmarks on LLaMA3 and Qwen2.5 families when integrated with state-of-the-art eviction baselines.
Enables block-wise prompt processing and dynamic KV cache budgeting, effectively mitigating memory/compute bottlenecks for long-context inference on resource-constrained hardware without requiring model retraining or fine-tuning.

Relies on a heuristic closed-form approximation rather than learned parameters; performance may vary across non-standard attention mechanisms or highly specialized architectures.
Primarily evaluated on decoder-only LLMs; generalization to multimodal, encoder-decoder, or hybrid models is not explicitly demonstrated.
The per-token computational overhead of calculating the CAOTE score during autoregressive generation must be carefully balanced against memory savings in ultra-low-latency deployment scenarios.