đź”— Source: arXiv

Fix the Structural Bottleneck: Context Compression via Explicit Information Transmission

🚀 Technical Novelty

  • Mechanism: Treats a frozen LLM as a feature extractor and decomposes compression into two stages: a widthwise optimal-transport-based global allocation plan to distribute information across compression slots, and a depthwise weighted shortcut mechanism to aggregate features across layers.
  • Nuance: Differs from prior SOTA by decoupling compression from the LLM’s trainable self-attention mechanism, explicitly solving the “lack of allocation” and “information dilution” bottlenecks inherent in standard LLM-as-a-compressor paradigms.

đź’ˇ Yield

  • Consistently outperforms strong soft-compression baselines across 12 datasets, improving average F1 by up to 18.5% while adding only ~1% trainable parameters and achieving >2Ă— faster compression latency.
  • Maintains performance close to uncompressed baselines even at high compression ratios (Ă—8, Ă—16) and scales effectively to larger models (Llama-3.1-8B) and longer contexts (8k).

⚠️ Limitations

  • Performance slightly lags on very short-context tasks (e.g., RelationExtraction with ~30 tokens) where fixed compression budgets constrain the effective ratio.
  • Relies on a frozen LLM as a feature extractor, meaning it cannot dynamically adapt its internal representations during the compression phase itself.