RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

RT-2 was introduced by Anthony Brohan, Noah Brown, and more than 50 co-authors at Google DeepMind in a paper submitted to arXiv on July 28, 2023. Its central idea is deceptively simple: take a vision-language model already trained on internet-scale image and text data, and fine-tune it so that it also outputs robot actions - represented as ordinary text tokens, identical in form to the words it already produces. Because actions are just another kind of token, the model can be co-fine-tuned on robot trajectories and on standard vision-language tasks like visual question answering at the same time.

This is the paper that defined the vision-language-action model (VLA) category. By keeping the web-scale knowledge in the model rather than training a robot policy from scratch, RT-2 inherited semantic understanding it never saw in robot data. In roughly 6,000 evaluation trials it generalized to novel objects, interpreted commands referencing concepts it had only read about (placing an object on a specific number or icon), and performed rudimentary reasoning such as picking the largest or smallest object. With chain-of-thought prompting it could carry out multi-step semantic reasoning before acting.

RT-2 built on the PaLM-E and PaLI-X vision-language backbones and on the data collection pipeline behind RT-1. It demonstrated that the path to general robot competence might run through the same large pretrained models driving progress in language and vision, rather than through robotics-specific engineering - a thesis that shaped Open X-Embodiment, OpenVLA, and the commercial robot foundation models that followed.

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Sources

Related