PaLM-E was introduced by Danny Driess and 21 co-authors from Google and UC Berkeley in a paper submitted to arXiv on March 6, 2023. It is an embodied multimodal language model: instead of taking only text, it consumes “multimodal sentences” that interleave words with encoded images and continuous robot state estimates. Those sensor inputs are projected into the same token space the language model already understands, so a single model can reason jointly over what it reads, what it sees, and what its body is doing.
The largest variant, PaLM-E-562B, contained 562 billion parameters, built by grounding the PaLM language model with a vision transformer. Trained jointly across robot manipulation planning, visual question answering, and image captioning, it exhibited positive transfer - the diverse joint training made it better at each domain than narrow single-task training would. PaLM-E-562B set a state-of-the-art result on the OK-VQA visual question answering benchmark while retaining its general language abilities.
For robotics, PaLM-E acted as a high-level planner that could turn a natural-language goal and a camera image into a sequence of steps, building on the planning idea SayCan introduced but folding perception directly into the model. It was the vision-language backbone that RT-2 then fine-tuned to emit robot actions, making PaLM-E a direct ancestor of the vision-language-action model line.