🔗 Source: arXiv

Sparser, Faster, Lighter Transformer Language Models

Mechanism: Introduces TwELL (Tile-wise ELLPACK) sparse packing format and fused CUDA kernels that eliminate index-management overhead, enabling efficient execution of entire gated feed-forward blocks in just two kernel launches per layer.
Nuance: Unlike prior sparse methods that suffered from GPU hardware/software mismatches or required heavy architectural deviations, this work targets compute-bound GEMM operations in batched settings, directly bridging the gap between unstructured sparsity and modern Tensor Core pipelines.

Mild L1 regularization induces >99% activation sparsity across feed-forward layers with negligible downstream performance degradation.
Delivers up to 20.5% inference speedup and 21.9% training speedup at billion-parameter scales, alongside substantial reductions in energy consumption and intermediate activation memory footprint.
Demonstrates that computational unevenness across network layers and natural language data can be systematically exploited for scalable efficiency gains without altering core model architectures.

Requires mitigation strategies (e.g., dead-neuron handling) to maintain performance at extreme sparsity levels, as noted in preliminary results.
Relies on ReLU activations for sparsity induction, which may diverge from smoother activation functions (e.g., SiLU) adopted in some contemporary architectures, though bridging techniques are acknowledged.
Primarily optimized for NVIDIA H100 GPUs and batched inference/training workloads; broader cross-platform or attention-sparsity integration is deferred to future work.