On October 31, 2024, the startup Physical Intelligence announced pi0 (pi-zero), which it described as a general-purpose robot foundation model. pi0 is a vision-language-action model: it builds on a pre-trained 3-billion-parameter vision-language model and adds a flow-matching action head - a diffusion variant - that generates continuous motor commands at up to 50 times per second from camera images and a text instruction. That high-frequency continuous output is what lets it perform dexterous, fluid manipulation rather than coarse pick-and-place.
The model was trained across eight robot platforms - including single and bimanual UR5e arms, a Franka arm, bimanual Trossen and ARX setups, and mobile bases - using internet-scale vision-language pretraining, open robot datasets such as Open X-Embodiment, and proprietary dexterous-task data collected by the company. Its showcase tasks were deliberately hard and human-like: folding shirts and shorts pulled from a hamper, bussing a table of mixed objects, bagging groceries, assembling a cardboard box, and folding towels.
pi0 marked the arrival of robot foundation models as a venture-backed commercial category rather than purely an academic pursuit. Founded by researchers including Sergey Levine and Chelsea Finn - the same names behind much of the academic VLA and imitation-learning work - Physical Intelligence positioned pi0 as a step toward a single model that can control many robots across many tasks.