🔗 Source: arXiv

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

🚀 Technical Novelty

  • Mechanism: Jointly optimizes an image encoder and temporal predictor end-to-end using a mean-squared prediction loss plus a SIGReg regularizer that enforces isotropic Gaussian latent distributions to prevent representation collapse.
  • Nuance: Unlike existing JEPAs that depend on stop-gradients, exponential moving averages, frozen pre-trained encoders, or complex multi-objective losses to stabilize training, LeWM achieves provable anti-collapse with a single tunable hyperparameter and zero training heuristics.

💡 Yield

  • Trains stably on a single GPU in hours with only 15M parameters while planning up to 48× faster than foundation-model-based world models under fixed compute.
  • Achieves competitive control performance across diverse 2D and 3D environments, with latent spaces that encode meaningful physical structure and reliably detect physically implausible events.

⚠️ Limitations

  • Planning horizons remain short, requiring hierarchical extensions for long-horizon reasoning.
  • Depends on offline datasets with sufficient interaction coverage, which can be costly or difficult to collect.
  • SIGReg regularization may struggle in low-intrinsic-dimensionality environments where matching an isotropic Gaussian prior is challenging.
  • Requires explicit action labels for prediction, limiting applicability when action annotations are unavailable.