đź”— Source: arXiv

Next-Latent Prediction Transformers Learn Compact World Models

🚀 Technical Novelty

  • Mechanism: Extends standard next-token prediction with an auxiliary self-supervised objective that trains a lightweight dynamics model to predict the transformer’s next latent state given the current latent state and new token.
  • Nuance: Unlike hybrid architectures or token-space methods, NextLat leaves the transformer architecture intact, using only an outer-loop auxiliary loss to enforce temporal consistency during fully parallel training without sequential computation overhead.

đź’ˇ Yield

  • Theoretically proves that latent representations provably converge to sufficient-statistic belief states for future prediction under the proposed objective.
  • Empirically demonstrates superior length generalization on NC1-complete state-tracking tasks (A5 word problem) compared to standard GPT baselines, with a co-trained RNN extrapolating far beyond the transformer’s expressivity limits.
  • Enables variable-length self-speculative decoding, accelerating language modeling inference by up to 3.3Ă— without requiring separate draft models or multi-token prediction heads.

⚠️ Limitations

  • Experiments restricted to simple MLP latent dynamics models; broader architectural search for the dynamics model remains unexplored.
  • Objective design (stop-gradients, KL distillation, loss functions) relies on small-scale ablations; necessity of specific components at larger model/data scales is unverified.
  • Computational constraints prevented evaluation of hidden layer width effects in the dynamics bottleneck or full scaling studies across diverse domains.