🔗 Source: arXiv

Deep Hierarchical Learning with Nested Subspace Networks for Large Language Models

Mechanism: Re-parameterizes linear layers using shared low-rank factor matrices (A, B) to construct a nested hierarchy of effective weights at varying ranks, optimized jointly via an uncertainty-aware objective that balances contributions across the hierarchy.
Nuance: Unlike static compression or discrete dynamic networks, NSNs provide a continuous, smooth compute-performance frontier post-hoc on frozen pre-trained models without altering tensor shapes or requiring specialized from-scratch training schemes.

Achieves up to 50% FLOPs reduction with only ~5% accuracy drop across multiple LLMs (Pythia, GPT-Neo, Gemma, Qwen).
Provides theoretical guarantees for granular budget control and smooth Pareto frontiers at inference.
Simultaneously satisfies instant test-time adaptability, post-hoc applicability to any foundation model, and architectural agnosticism.

Currently applies uniform rank scaling across all layers rather than layer-specific adaptive compute.
Requires future work to correlate problem-specific information with layer-specific representational capacity for fine-grained control.
FLOPs reduction is bounded by a dimension-dependent break-even point, limiting gains for certain layer sizes or specific input/output configurations.