🔗 Source: arXiv

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Mechanism: Jointly optimizes an image encoder and temporal predictor end-to-end using a mean-squared prediction loss plus a SIGReg regularizer that enforces isotropic Gaussian latent distributions to prevent representation collapse.
Nuance: Unlike existing JEPAs that depend on stop-gradients, exponential moving averages, frozen pre-trained encoders, or complex multi-objective losses to stabilize training, LeWM achieves provable anti-collapse with a single tunable hyperparameter and zero training heuristics.

Trains stably on a single GPU in hours with only 15M parameters while planning up to 48× faster than foundation-model-based world models under fixed compute.
Achieves competitive control performance across diverse 2D and 3D environments, with latent spaces that encode meaningful physical structure and reliably detect physically implausible events.

Planning horizons remain short, requiring hierarchical extensions for long-horizon reasoning.
Depends on offline datasets with sufficient interaction coverage, which can be costly or difficult to collect.
SIGReg regularization may struggle in low-intrinsic-dimensionality environments where matching an isotropic Gaussian prior is challenging.
Requires explicit action labels for prediction, limiting applicability when action annotations are unavailable.