🔗 Source: arXiv

mHC: Manifold-Constrained Hyper-Connections

🚀 Technical Novelty

  • Mechanism: Projects learnable residual stream mixing matrices onto a doubly stochastic manifold via Sinkhorn-Knopp optimization, coupled with kernel fusion, activation recomputing, and DualPipe communication overlap to minimize memory/access overhead.
  • Nuance: Unlike prior Hyper-Connections that arbitrarily expand residual width (sacrificing stability and increasing memory traffic), mHC mathematically enforces forward/backward signal conservation across streams while co-designing low-level infrastructure for compute efficiency.

💡 Yield

  • Achieves consistent downstream improvements (+2.1% BBH, +2.3% DROP) on a 27B model over both baseline and unconstrained HC architectures.
  • Reduces propagation gain magnitude by three orders of magnitude compared to HC, ensuring bounded forward signal and backward gradient flows during multi-layer stacking.
  • Demonstrates robust compute and token scaling trajectories, maintaining loss advantages up to 27B parameters without training divergence.

⚠️ Limitations

  • Approximates the doubly stochastic constraint using a fixed 20-iteration Sinkhorn-Knopp routine for speed, causing minor gradient gain deviation from exactly 1.
  • Infrastructure optimizations (kernel fusion, recomputing, communication overlap) require custom system-level integration and are not presented as a universal, framework-agnostic drop-in module.
  • Empirical validation is focused on LLM pre-training; broader architectural generalization or applicability to non-Transformer modalities remains unexplored.