🔗 Source: arXiv

Higher Layers Need More LoRA Experts

🚀 Technical Novelty

  • Mechanism: Integrates LoRA adapters into a Mixture-of-Experts framework with flexible layer-wise allocation, allowing each Transformer block to use a distinct number of trainable low-rank matrices instead of a fixed count.
  • Nuance: Departs from uniform expert distribution in prior LoRA-MoE methods by empirically proving and implementing asymmetric allocation, assigning more experts to deeper layers where representational redundancy is lowest and abstract feature learning is highest.

💡 Yield

  • Achieves superior performance across six NLP/commonsense QA benchmarks with fewer trainable parameters than all PEFT baselines; layer-wise Frobenius norm analysis confirms lower layers suffer higher expert redundancy, validating the asymmetric design.
  • Demonstrates strong continuous learning capabilities, significantly reducing domain knowledge forgetting compared to standard LoRA during sequential fine-tuning.

⚠️ Limitations

  • Relies on static, pre-defined layer-wise expert counts rather than fully dynamic per-input routing during training or inference.
  • Primarily validated on decoder-only LLMs for text-based tasks; multimodal, vision, or cross-architecture generalization remains unexplored.