🔗 Source: arXiv

Training-Free Looped Transformers

Mechanism: Wraps a frozen checkpoint’s contiguous mid-stack layers in an iterative loop at inference, modeling each pre-norm block as a forward Euler step on an implicit ODE and refining it via higher-order solvers (e.g., K-stage Runge-Kutta).
Nuance: Eliminates the end-to-end training and weight-tied architecture required by prior recurrent transformers; introduces “layer-mode” iteration to stabilize expert routing in MoE models during inference, which block-mode fails to do.

Delivers consistent +1.14–2.64 pp accuracy gains across 7 model families (dense, MoE, MLA+MoE) on knowledge-heavy benchmarks without per-cell hyperparameter tuning or parameter updates.
Demonstrates that naive looping degrades performance by pushing activations out of the trained regime, while ODE-inspired sub-stepping strategies stabilize inference and extract latent reasoning capacity from frozen weights.

Increases inference latency proportionally to the loop count K due to extra forward passes through the window.
Provides diminishing returns (3–4× smaller gains) for MLA-based MoE models and shows instability on sub-3B distilled checkpoints for certain knowledge tasks.