Octo was released by the Octo Model Team - 18 collaborators including Dibya Ghosh, Homer Walke, and Sergey Levine - in a paper submitted to arXiv on May 20, 2024. It is a transformer-based generalist robot policy meant to serve as an open, reusable starting point for manipulation, in the same spirit as a pretrained vision or language backbone. A robot can drive it with either a language command or a goal image, and the model outputs continuous actions through a diffusion-based head.
Octo was trained on 800,000 robot trajectories from the Open X-Embodiment dataset, at the time the largest robot manipulation dataset available. Its design emphasizes flexibility: the policy can be fine-tuned to robots with new cameras, sensors, and action spaces within a few hours on standard consumer GPUs, and the team validated it across nine robotic platforms as an effective initialization for downstream learning.
Octo and OpenVLA, both released within weeks of each other in 2024, established the open-source counterweight to closed robot foundation models like RT-2. Where OpenVLA leaned on a large vision-language backbone, Octo prioritized a lightweight, easily fine-tuned policy, giving labs and companies a practical generalist they could adapt to their own robots without collecting a huge dataset first.