KV Cache Token Eviction
🔗 Source: arXiv
CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction
🚀 Technical Novelty
- Mechanism: Derives a closed-form eviction score that minimizes the mean squared error of the attention output by jointly weighting query-key alignment and value vector contributions.
- Nuance: Moves beyond pure attention-score heuristics by explicitly modeling how each cached token’s value impacts the final hidden state, enabling it to function as a model-agnostic meta-heuristic atop existing eviction methods.
💡 Yield
- Consistently improves accuracy across LongBench, Needle-in-Haystack, and perplexity benchmarks when integrated with SOTA eviction strategies (H2O, TOVA, SnapKV) on LLaMA3 and Qwen2.5 families.
- Provides a theoretically grounded, post-training optimization that alleviates quadratic attention compute and KV cache memory bottlenecks for long-context inference.
⚠️ Limitations
- Adds minor computational overhead during inference compared to pure attention-score methods (addressed via the proposed FastCAOTE approximation).
- Validated exclusively on decoder-only LLMs; generalization to non-Transformer or multimodal architectures remains unexplored.
- Operates strictly as a post-training eviction heuristic without modifying model weights or training objectives.