Manifold-Constrained Hyper-Connections
🔗 Source: arXiv
mHC: Manifold-Constrained Hyper-Connections
🚀 Technical Novelty
- Mechanism: Projects learnable residual stream mixing matrices onto a doubly stochastic manifold via Sinkhorn-Knopp optimization, coupled with kernel fusion, activation recomputing, and DualPipe communication overlap to minimize memory/access overhead.
- Nuance: Unlike prior Hyper-Connections that arbitrarily expand residual width (sacrificing stability and increasing memory traffic), mHC mathematically enforces forward/backward signal conservation across streams while co-designing low-level infrastructure for compute efficiency.
💡 Yield
- Achieves consistent downstream improvements (+2.1% BBH, +2.3% DROP) on a 27B model over both baseline and unconstrained HC architectures.
- Reduces propagation gain magnitude by three orders of magnitude compared to HC, ensuring bounded forward signal and backward gradient flows during multi-layer stacking.
- Demonstrates robust compute and token scaling trajectories, maintaining loss advantages up to 27B parameters without training divergence.
⚠️ Limitations
- Approximates the doubly stochastic constraint using a fixed 20-iteration Sinkhorn-Knopp routine for speed, causing minor gradient gain deviation from exactly 1.
- Infrastructure optimizations (kernel fusion, recomputing, communication overlap) require custom system-level integration and are not presented as a universal, framework-agnostic drop-in module.
- Empirical validation is focused on LLM pre-training; broader architectural generalization or applicability to non-Transformer modalities remains unexplored.