Predictive FFN Sparsity
🔗 Source: arXiv
FASTFORWARD: ACCELERATING LLM PREFILL WITH PREDICTIVE FFN SPARSITY
🚀 Technical Novelty
- Mechanism: Block-wise context-aware FFN sparsity driven by a lightweight expert predictor, an auxiliary error compensation network, and a layer-wise compute scheduler.
- Nuance: Unlike decoding-focused or static masking methods that break prefill parallelism or require unavailable prompt statistics, it proactively predicts neuron importance per block while dynamically allocating fidelity based on token-mixing importance.
💡 Yield
- Achieves up to 1.45× compute-bound speedup at 50% sparsity with <6% accuracy drop across LLaMA/Qwen models (1B–8B) on LongBench.
- Proves FFNs dominate prefill FLOPs for contexts up to ~28K tokens, making targeted sparsity the most impactful lever for TTFT reduction.
⚠️ Limitations
- The error compensator lacks direct signals about selected/pruned experts, limiting correction magnitude for suboptimal selections.
- Current dynamic expert loading remains memory-bandwidth intensive; full efficiency requires future static expert integration.