GAIA, short for “a benchmark for General AI Assistants,” was posted to arXiv on November 21, 2023 by Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom, from Meta AI and Hugging Face. It set out to measure how well an AI system can act as a general assistant on tasks that are conceptually simple for people but demand a chain of real abilities.
The benchmark contains 466 questions, each requiring some mix of reasoning, handling multiple modalities, browsing the web, and using tools to reach a single correct answer. The questions are deliberately easy to state and hard to automate - the kind of thing a capable human handles routinely. The headline result was a stark gap: human respondents answered 92 percent correctly, while GPT-4 equipped with plugins managed only 15 percent. The authors highlighted that this inverts the usual story of models beating humans on professional exams in law or chemistry, exposing how far agents were from everyday competence. GAIA also addresses an evaluation problem - its answers are factual and unambiguous, so scoring does not depend on a judge model.
GAIA became one of the most-watched yardsticks for general-purpose agents, and progress on it was a key claim for later systems such as Microsoft’s Magentic-One. Its design - simple-for-humans tasks that nonetheless require real tool use and web navigation - made it a durable test of whether agents were closing the gap to practical assistance.