V-JEPA extends the Joint-Embedding Predictive Architecture (JEPA) principle from images to video, training a visual encoder by predicting masked spatio-temporal regions of a video within a learned ...
A recipe to learn generalist robot policies from large-scale human and robot videos without action labels. A novel approach to extract motion-centric latent actions that capture fine-grained physical ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results