🔗 Source: arXiv

REASONING WITH LATENT THOUGHTS: ON THE POWER OF LOOPED TRANSFORMERS

Mechanism: Introduces a weight-sharing looping mechanism where a k-layer transformer block is iteratively applied L times, creating an effective depth of kL while retaining only k-layer parameters. Also proposes a layer-similarity regularization to amplify this inductive bias during training.
Nuance: Decouples architectural depth from parameter count, contrasting with standard scaling laws that tie depth to model size. Unlike Chain-of-Thought which expands tokens at inference, looped models perform iterative computation internally via shared weights during the forward pass.

Looping matches or exceeds iso-FLOP non-looped baselines on synthetic and downstream reasoning tasks despite using L× fewer parameters.
Theoretical proofs demonstrate looped transformers can simulate iterative algorithms, solve p-hop induction with O(log p) loops, and formally simulate multi-step Chain-of-Thought reasoning via masking and token decoding.
Downstream accuracy scales logarithmically with effective depth (loops), revealing a strong inductive bias toward compositional reasoning over memorization.

Evaluated primarily on synthetic procedural tasks; generalization to complex real-world, multimodal, or common-sense reasoning remains unverified.
Suffers from higher perplexity compared to parameter-matched baselines due to the inherent trade-off between depth and model capacity.
Lacks a formalized, unified definition of “reasoning,” limiting direct comparison across diverse cognitive benchmarks and leaving broader architectural implications open for future work.