OpenVLA: An Open-Source Vision-Language-Action Model

OpenVLA was introduced by Moo Jin Kim and 17 co-authors - including Sergey Levine, Chelsea Finn, and Dorsa Sadigh - in a paper submitted to arXiv on June 13, 2024. Its motivation was openness: the most capable vision-language-action models, such as Google DeepMind’s RT-2 and RT-2-X, were closed and inaccessible, so the community could not study, fine-tune, or deploy them. OpenVLA released a fully open model with checkpoints, PyTorch training code, and fine-tuning notebooks under a permissive license.

The 7-billion-parameter model is built on the Llama 2 language backbone, with a vision encoder that fuses features from DINOv2 and SigLIP, and it was trained on 970,000 real-world robot demonstrations drawn from the Open X-Embodiment dataset. Despite being far smaller, OpenVLA outperformed the closed 55-billion-parameter RT-2-X by 16.5 percentage points in absolute task success across 29 tasks, and beat Diffusion Policy by 20.4 points in multi-task settings.

Why business readers should care: OpenVLA showed that a state-of-the-art robot foundation model could be open and run on modest hardware, the way Llama did for language models. It lowered the barrier for companies and labs to build on robot foundation models rather than depending on a single vendor’s closed system, and it became a common starting point for fine-tuning robot policies.

OpenVLA: An Open-Source Vision-Language-Action Model

Sources

Related