Next Implicit Token Prediction
🔗 Source: arXiv
NITP: Next Implicit Token Prediction for LLM Pre-training
🚀 Technical Novelty
- Mechanism: Predicts temporally shifted shallow-layer representations as dense, self-supervised targets via cosine similarity loss alongside standard NTP.
- Nuance: Unlike prior multi-token or distillation methods that operate in discrete token space or align static layers, NITP enforces autoregressive latent-space geometry to explicitly counteract anisotropic representation collapse.
💡 Yield
- Theoretically regularizes the optimization landscape by constraining under-constrained degrees of freedom; empirically delivers consistent downstream gains (e.g., +5.7% on MMLU-Pro for 9B MoE) with only ~2% additional training FLOPs and zero inference overhead.
⚠️ Limitations
- Validated primarily on models up to 9B parameters; reliance on shallow-layer semantic richness may require architectural or corpus-specific tuning for broader generalization.