SayCan, formally titled “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances,” was published by Michael Ahn, Anthony Brohan, Noah Brown, and more than 40 co-authors at Google in a paper submitted to arXiv on April 4, 2022. It addressed a mismatch at the heart of using large language models for robots: an LLM knows the steps a task like “I spilled my drink, can you help?” requires, but it has no idea what the robot in front of it can physically do right now.
The method combines two signals for each candidate skill. The language model scores how useful a skill is for completing the instruction (the “say” part), while a learned value function for each low-level skill scores how likely the robot is to succeed at that skill from its current state (the “can” part). Multiplying the two and picking the highest-scoring skill yields a plan that is both relevant to the request and feasible for this robot in this environment. The robot serves as the language model’s hands and eyes; the LLM supplies the procedural knowledge.
SayCan was evaluated on long-horizon, abstract, natural-language instructions on a mobile manipulator in a real office kitchen, chaining multiple skills to satisfy each request. A later revision added results using the much larger PaLM model, plus studies of drawer manipulation and multilingual instructions. SayCan was an early and influential demonstration that LLMs could act as high-level planners for embodied agents, a pattern that fed directly into PaLM-E and the vision-language-action models that followed.