FastForward Predictive FFN Sparsity
🔗 Source: arXiv
FastForward: Accelerating LLM Prefill with Predictive FFN Sparsity
🚀 Technical Novelty
- Mechanism: Block-wise context-aware FFN sparsity driven by a lightweight expert predictor that proactively selects high-importance neurons, augmented by an error compensation network and layer-wise sparsity scheduler.
- Nuance: Unlike prior decoding-focused or static masking methods, it exploits prefill parallelism through proactive, block-level prediction and dynamic compute allocation, avoiding the accuracy collapse of uniform sparsity on long prompts.
💡 Yield
- Delivers up to 1.45× compute-bound speedup at 50% FFN sparsity across LLaMA/Qwen models (1B–8B) with <6% accuracy loss on LongBench, significantly reducing Time-to-First-Token for short-to-moderate context workloads.
⚠️ Limitations
- The error compensator lacks direct signals about which neurons were pruned, limiting its capacity to correct large deviations from suboptimal expert selection; dynamic expert loading remains memory-bandwidth intensive compared to static alternatives.