đź”— Source: arXiv

Training-Free Looped Transformers

🚀 Technical Novelty

  • Mechanism: Wraps a frozen checkpoint’s contiguous mid-stack layers in an iterative loop at inference, modeling each pre-norm block as a forward Euler step on an implicit ODE and refining it via higher-order solvers (e.g., K-stage Runge-Kutta).
  • Nuance: Eliminates the end-to-end training and weight-tied architecture required by prior recurrent transformers; introduces “layer-mode” iteration to stabilize expert routing in MoE models during inference, which block-mode fails to do.

đź’ˇ Yield

  • Delivers consistent +1.14–2.64 pp accuracy gains across 7 model families (dense, MoE, MLA+MoE) on knowledge-heavy benchmarks without per-cell hyperparameter tuning or parameter updates.
  • Demonstrates that naive looping degrades performance by pushing activations out of the trained regime, while ODE-inspired sub-stepping strategies stabilize inference and extract latent reasoning capacity from frozen weights.

⚠️ Limitations

  • Increases inference latency proportionally to the loop count K due to extra forward passes through the window.
  • Provides diminishing returns (3–4Ă— smaller gains) for MLA-based MoE models and shows instability on sub-3B distilled checkpoints for certain knowledge tasks.