Foundation Models Meet The Real World

‍

In our last post, we explored how AI is entering the physical world. What we called situational computing—systems grounded in physical context—is starting to emerge across a variety of domains, from immersive communication to embodied interaction.

‍

This month, the theme continues—but with a more mechanical twist. Meta and Google have released major updates that push robotics forward in two complementary directions: one focused on perception and planning, the other on dexterity and deployment. Together, they hint at a new phase for physical AI—one where robots see, understand, manipulate, and act.

‍

From Video to World Models

‍

Meta’s V-JEPA 2 is a “world model”—a term borrowed from cognitive science, referring to an internal simulation of how the world works. The idea is straightforward: if an AI system can predict the consequences of actions in its environment, it can make better decisions, even in unfamiliar settings. What makes V-JEPA 2 notable is the scale, the modality, and the downstream utility.

‍

Trained on over a million hours of video, the model can reason about motion, predict future events, and guide robotic actions without needing fine-tuned data from each deployment environment. In short, it anticipates what’s likely to happen next.

‍

Meta demonstrates how V-JEPA 2 can be used for zero-shot robot planning. A robot sees where it is, compares that to a visual goal, and uses the model to evaluate candidate actions—choosing the one most likely to succeed. Over time, it re-plans and adjusts. The robot doesn’t need to memorize specific scenarios; it’s equipped with a sense of physical cause and effect.

‍

It’s a powerful step forward in robot autonomy. Instead of pre-programming specific situations, the robot is provided with a smarter internal reasoning compass.

‍

Hands, Not Just Eyes

‍

If V-JEPA 2 gives robots the ability to imagine, Google’s Gemini Robotics On-Device gives them the ability to act.

‍

Released just days ago, Gemini Robotics On-Device is a vision-language-action (VLA) model optimized to run locally on the robot itself. That’s a meaningful shift—reducing latency and eliminating the need for constant connectivity. More importantly, it also enables practical deployment in real-world environments where cloud access may be unreliable or impossible.

‍

This model is aimed at two-armed robotics, demonstrating general-purpose dexterity across a wide range of tasks, from folding dresses to unzipping bags. Developers can fine-tune it with as few as 50–100 examples to adapt to novel tasks. Importantly, the model isn’t confined to a single robot design. Google shows it working across multiple embodiments, including both the ALOHA platform and humanoid systems like Apptronik’s Apollo.

‍

It’s not hard to imagine what this unlocks: multi-purpose robots that can adapt on the fly, quickly learn new behaviors, and operate with full-stack autonomy. This can include everything from perception and planning to physical execution—all without relying on remote servers or task-specific retraining.

‍

A New Robotics Stack

‍

Taken together, V-JEPA 2 and Gemini Robotics On-Device suggest that a new robotics stack is emerging which is modular, scalable, and increasingly general in nature.

‍

Meta is focused on building a foundation for physical reasoning—training world models that allow agents to simulate, plan, and adapt. These models don’t care what body the agent inhabits; they’re about understanding the dynamics of the world itself.

‍

Google, on the other hand, is pushing the execution layer forward—designing models that can live on the robot, operate with low-latency, and manipulate a wide range of physical objects with surprising dexterity.

‍

The former is about foresight. The latter is about follow-through.

‍

The Physical Frontier

‍

This isn't the return of clunky industrial automation. These systems aren’t coded step-by-step or trained on one repetitive motion. They're learning from the world, adapting to it, and beginning to reason in ways that mirror how we interact with the physical environment: observing, simulating, trying, adjusting.

‍

For AI to move beyond screens and prompts, it needs to inhabit space. And for robotics to move beyond the lab, it needs models that are compact, general, and trainable in real-world conditions. That’s where physical AI lives.

‍

We’re still early. But with models like V-JEPA 2 and Gemini Robotics On-Device, the future of robotics is starting to look a lot more like software.

‍