Stable Looped Language Models
🔗 Source: arXiv
Parcae: Scaling Laws For Stable Looped Language Models
🚀 Technical Novelty
- Mechanism: Constrains the spectral norm of injection parameters using negative diagonal discretization and input normalization to prevent residual stream explosion in recurrent layers.
- Nuance: Replaces fragile hyperparameter tuning and post-norm tricks with a control-theoretic dynamical systems framework, treating looping as an orthogonal compute-scaling axis rather than a fixed-depth approximation.
💡 Yield
- Derives unified scaling laws showing training FLOPs should scale looping and data via power laws, while test-time compute follows a predictable saturating exponential decay.
- Achieves up to 6.3% lower validation perplexity than prior looped models and matches Transformer quality at twice the parameter count under fixed budgets.
⚠️ Limitations
- Empirical validation is limited to smaller scales (up to 1.3B parameters); large-scale FLOP extrapolation remains unverified.
- Test-time quality gains saturate at training recurrence depth, necessitating more inference steps for marginal improvements.