🔗 Source: arXiv

Next-Embedding Prediction Makes Strong Vision Learners

Mechanism: Trains a Vision Transformer autoregressively to predict the next patch embedding in continuous space using causal masking and stop-gradient, directly mirroring next-token prediction in LLMs.
Nuance: Eliminates the need for pixel decoders, discrete tokenizers, contrastive pairs, momentum encoders, or task-specific heads, relying purely on a single predictive objective over continuous embeddings rather than representation extraction.

Achieves 83.8% (ViT-B) and 85.3% (ViT-L) top-1 accuracy on ImageNet-1K after standard fine-tuning.
Transfers effectively to dense prediction tasks, delivering strong semantic segmentation performance on ADE20K.
Proves that generative pretraining principles from NLP can be directly applied to vision via continuous embedding prediction, yielding a scalable, architecturally simple self-supervised paradigm.

Performs poorly under standard linear probing because the final autoregressive output remains shallow and closely tied to the initial patch embeddings.
Struggles with complex physical reasoning tasks like interpreting reflections, shading, shadows, and dense overlapping objects, likely due to ImageNet-1k dataset limitations rather than architectural flaws.
Current validation is limited to ImageNet-1K; scalability and robustness on more diverse, large-scale vision datasets remain untested.