Layer-Wise LoRA Expert Allocation
🔗 Source: arXiv
Higher Layers Need More LoRA Experts
🚀 Technical Novelty
- Mechanism: Introduces MoLA, a PEFT framework that replaces uniform LoRA-MoE distributions with flexible, layer-wise expert allocation across Transformer blocks.
- Nuance: Departs from prior SOTA by empirically proving lower layers suffer representational collapse/redundancy, enabling targeted expert specialization in higher layers rather than fixed per-layer counts.
💡 Yield
- Achieves equal or superior performance on six NLP/commonsense QA benchmarks compared to all PEFT baselines using fewer total parameters.
- Demonstrates that asymmetric allocation (e.g., 2-4-6-8 experts) significantly outperforms uniform configurations by mitigating lower-layer redundancy.
- Exhibits strong continuous learning capabilities, minimizing domain knowledge forgetting during sequential fine-tuning across multiple subjects.
⚠️ Limitations
- Relies on static, pre-defined layer-wise expert configurations rather than fully dynamic, input-aware allocation during training.
- Computational overhead of maintaining multiple routers and expert pairs per layer is not explicitly optimized or compared against standard LoRA inference latency.
- Validation is limited to decoder-only LLMs and instruction-tuning benchmarks, leaving encoder-decoder and multimodal extensions unexplored.