VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Опубликовано: 25 Май 2026
на канале: Josef Albers
3,225
82

This video introduces VL-JEPA, a novel vision-language model based on a Joint Embedding Predictive Architecture that prioritizes efficiency and semantic depth. Unlike traditional models that generate text token-by-token, VL-JEPA operates in a continuous latent space, predicting target embeddings to focus on meaning while ignoring superficial linguistic variations. This architecture allows the model to outperform standard generative systems while using 50% fewer trainable parameters, demonstrating superior sample efficiency. The system natively supports selective decoding, a feature that drastically reduces computational costs for real-time video applications by updating text only when significant semantic shifts occur. Beyond captioning, its unified design excels at open-vocabulary classification and text-to-video retrieval, surpassing established models like CLIP. Ultimately, VL-JEPA establishes a more responsive and efficient foundation for machine intelligence to understand and interact with the physical world.

Unofficial implementation is available at: https://github.com/JosefAlbers/VL-JEPA
Short:    • VL-JEPA(JOINT EMBEDDING PREDICTIVE ARCHITE...  
Paper: https://arxiv.org/abs/2512.10942