🔗 Source: arXiv

NITP: Next Implicit Token Prediction for LLM Pre-training

🚀 Technical Novelty

  • Mechanism: Predicts temporally shifted shallow-layer representations as dense, self-supervised targets via cosine similarity loss alongside standard NTP.
  • Nuance: Unlike prior multi-token or distillation methods that operate in discrete token space or align static layers, NITP enforces autoregressive latent-space geometry to explicitly counteract anisotropic representation collapse.

💡 Yield

  • Theoretically regularizes the optimization landscape by constraining under-constrained degrees of freedom; empirically delivers consistent downstream gains (e.g., +5.7% on MMLU-Pro for 9B MoE) with only ~2% additional training FLOPs and zero inference overhead.

⚠️ Limitations

  • Validated primarily on models up to 9B parameters; reliance on shallow-layer semantic richness may require architectural or corpus-specific tuning for broader generalization.