At Meta, artificial intelligence is no longer content with writing poems or sorting images. With V-JEPA 2, the group wants to go further: to help machines understand the world as we do every day, by observing. This new version of the V-JEPA model is capable of predicting what will happen in a scene, anticipating movements, or even planning actions in an unknown environment—like a robot that can guess that an egg cooked in a pan is supposed to end up on a plate.
An AI that learns like a child (or almost)
Meta's ambition is to develop what the company calls "world models," that is, AI capable of mentally simulating the consequences of an action before performing it. "We believe these models will usher in a new era for robotic agents, capable of interacting in the real world without requiring massive amounts of training data," explains Yann LeCun, Meta's AI science director. To acquire this form of common sense, V-JEPA 2 was trained on a very large scale: more than a million hours of video, without human commentary or annotations, were used to train its first level of understanding. The model is based on an architecture called JEPA, which separates the encoding of a situation (via video) from the prediction of what will happen next. This system learns to anticipate an action before it takes place—for example, in the Epic-Kitchens dataset, it is able to guess what a person will do a second later in their kitchen. Even better: once aligned with a language model, V-JEPA 2 excels at tasks like answering questions from a video.
But it is especially in robotics that the model shows concrete results. After a second training phase with only 62 hours of data from robots in action, V-JEPA 2 is able to plan simple gestures: grabbing an object, moving it, placing it in another location — even if the object or location was never seen during training.
One of the most interesting aspects is that the robot does not need to be trained in its final environment. Thanks to a standardized dataset, Meta can directly transfer its model to its own robots in the laboratory, without specific adaptation. It simply needs to observe the current scene and know the visual objective to be achieved (for example, an image of an object placed in a certain location) to imagine scenarios and choose the most promising action.
Meta claims success rates of between 65 and 80% on these "pick-and-place" type tasks, even in unknown environments. V-JEPA 2 is also said to be 30 times faster than Nvidia's Cosmos model, according to Meta's criteria.
0 Comments