🔗 Source: arXiv

Interleaved Head Attention

🚀 Technical Novelty

  • Mechanism: Constructs P pseudo-queries, keys, and values per head as learned linear combinations of all original heads’ projections, enabling up to P² cross-head attention patterns within each head.
  • Nuance: Unlike prior head-mixing methods that operate on attention logits post-computation, IHA mixes inputs before the standard softmax operator, preserving compatibility with optimized kernels like FlashAttention while fundamentally altering information flow across heads.

💡 Yield

  • Theoretical proof of superior parameter efficiency for polynomial filters (Θ(√kn²) vs Θ(kn²)) and order-sensitive tasks (⌈√N_max⌉ heads vs N_max).
  • Empirical gains: 10–20% improvement in multi-key retrieval on RULER (4k–16k context), +5.8% on GSM8K and +2.8% on MATH-500 after reasoning fine-tuning, all under FLOP-matched training.

⚠️ Limitations

  • Global IHA increases attention cost to O(P²N²), requiring sliding-window schedules or adaptive pseudo-head allocation to remain practical.
  • Current validation is limited to decoder-only LLMs; extensions to encoder-decoder and vision architectures are noted as future work.