Sparse Efficient LLM Kernels
🔗 Source: arXiv
Sparser, Faster, Lighter Transformer Language Models
🚀 Technical Novelty
- Mechanism: Introduces TwELL (Tile-wise ELLPACK) sparse packing format and fused CUDA kernels that eliminate index-management overhead, enabling efficient execution of entire gated feed-forward blocks in just two kernel launches per layer.
- Nuance: Unlike prior sparse methods that suffered from GPU hardware/software mismatches or required heavy architectural deviations, this work targets compute-bound GEMM operations in batched settings, directly bridging the gap between unstructured sparsity and modern Tensor Core pipelines.
💡 Yield
- Mild L1 regularization induces >99% activation sparsity across feed-forward layers with negligible downstream performance degradation.
- Delivers up to 20.5% inference speedup and 21.9% training speedup at billion-parameter scales, alongside substantial reductions in energy consumption and intermediate activation memory footprint.
- Demonstrates that computational unevenness across network layers and natural language data can be systematically exploited for scalable efficiency gains without altering core model architectures.
⚠️ Limitations
- Requires mitigation strategies (e.g., dead-neuron handling) to maintain performance at extreme sparsity levels, as noted in preliminary results.
- Relies on ReLU activations for sparsity induction, which may diverge from smoother activation functions (e.g., SiLU) adopted in some contemporary architectures, though bridging techniques are acknowledged.
- Primarily optimized for NVIDIA H100 GPUs and batched inference/training workloads; broader cross-platform or attention-sparsity integration is deferred to future work.