🔗 Source: arXiv

Sparser, Faster, Lighter Transformer Language Models

Mechanism: Introduces the TwELL (Tile-wise ELLPACK) sparse packing format and fused CUDA kernels that seamlessly integrate unstructured sparsity into modern GPU execution pipelines for LLM feed-forward blocks.
Nuance: Unlike prior methods that rely on structured pruning or suffer from index-management overheads, this approach uses mild L1 regularization during training to naturally induce >99% sparsity while fusing operations to eliminate memory bottlenecks on compute-bound GEMM tasks.

Mild L1 regularization achieves over 99% unstructured sparsity with negligible downstream performance degradation.
Delivers up to 20.5% inference and 21.9% training speedups on billion-parameter models, with scaling benefits for throughput, energy efficiency, and memory footprint.

Optimizations are tightly coupled to modern NVIDIA GPU architectures (e.g., Tensor Cores) and batched computation settings.
Highly sparse models may still require targeted strategies to mitigate dead neurons, and direct application to existing dense pretrained models requires additional sparsification fine-tuning.