Vision-Language-Action Model (VLA)

A vision-language-action model, or VLA, is a single neural network that takes in what a robot sees (camera images) and what it is asked to do (a natural-language instruction), and outputs the actions the robot should take. The defining trick, introduced by Google DeepMind’s RT-2 in 2023, is to start from a vision-language model already trained on internet-scale image and text data, then teach it to also produce robot actions - frequently by encoding actions as ordinary text tokens, so the same model that describes a scene can also command a motor.

The point of the VLA approach is transfer. A robot policy trained only on robot data has seen very little of the world; a VLA inherits broad knowledge from web pretraining, so it can recognize objects it never manipulated in training and follow instructions phrased in unfamiliar ways. This is why RT-2 could place an object on a specific icon it had only read about, and why VLAs generalize better to novel objects and commands than older robot-specific policies.

The category has grown quickly: RT-2 was closed, but open VLAs followed in 2024, including OpenVLA (built on Llama 2) and Octo, both trained on the pooled multi-robot Open X-Embodiment dataset, plus commercial robot foundation models from companies like Physical Intelligence.

Why business readers should care: VLAs are the bet that robots will follow the same path as chatbots - one large pretrained model adapted to many tasks, rather than bespoke programming per task. If it holds, robot capability becomes a software and data problem more than a mechanical-engineering one.

Vision-Language-Action Model (VLA)

Sources

Related