Predictive FFN Sparsity

🔗 Source: arXiv

FASTFORWARD: ACCELERATING LLM PREFILL WITH PREDICTIVE FFN SPARSITY

🚀 Technical Novelty

Mechanism: Block-wise context-aware FFN sparsity driven by a lightweight expert predictor, an auxiliary error compensation network, and a layer-wise compute scheduler.
Nuance: Unlike decoding-focused or static masking methods that break prefill parallelism or require unavailable prompt statistics, it proactively predicts neuron importance per block while dynamically allocating fidelity based on token-mixing importance.

💡 Yield

Achieves up to 1.45× compute-bound speedup at 50% sparsity with <6% accuracy drop across LLaMA/Qwen models (1B–8B) on LongBench.
Proves FFNs dominate prefill FLOPs for contexts up to ~28K tokens, making targeted sparsity the most impactful lever for TTFT reduction.

⚠️ Limitations

The error compensator lacks direct signals about selected/pruned experts, limiting correction magnitude for suboptimal selections.
Current dynamic expert loading remains memory-bandwidth intensive; full efficiency requires future static expert integration.