RT-1: Robotics Transformer for Real-World Control at Scale

RT-1 (Robotics Transformer 1) was introduced by Anthony Brohan and more than 50 co-authors at Google in a paper submitted to arXiv on December 13, 2022. It applied the recipe that had worked in language and vision - a high-capacity transformer trained on a large, diverse dataset - to the problem of real-world robot control. The model takes a short sequence of camera images plus a natural-language task description and outputs a discretized action token for the robot to execute at each time step.

The training data was the headline contribution. The team collected over 130,000 episodes spanning more than 700 distinct tasks, gathered with a fleet of 13 robots over 17 months in real kitchens and offices. Skills included picking, placing, opening and closing drawers, moving objects between locations, and opening jars. The architecture combined an EfficientNet image encoder conditioned on the task embedding, a TokenLearner module to compress the visual tokens, and a transformer that produced the action tokens.

On tasks it had seen in training, RT-1 reached a 97 percent success rate, beating the prior BC-Z and Gato baselines. More importantly it generalized: it executed 76 percent of never-before-seen instructions, 24 points better than the next best baseline, and stayed robust to new backgrounds and distractor objects. RT-1 became the foundation for a line of work - RT-2, RT-X, and the broader vision-language-action model family - that treats robot control as a large-scale learning problem rather than hand-engineered control.

Sources

Last verified June 7, 2026