BigScience

BigScience was a value-driven, open research initiative that ran for about a year and a half and brought together hundreds of researchers from around the world to study and build large language models in the open. The collaboration is documented in its own retrospective paper, “BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model” (arXiv 2212.04960, December 9, 2022), by Christopher Akiki, Giada Pistilli, Thomas Wolf, Yacine Jernite, and colleagues, and in the BLOOM model paper it produced.

Its two best-known outputs are the ROOTS corpus, a 1.6-terabyte multilingual dataset spanning dozens of languages, and BLOOM, the 176-billion-parameter open multilingual model trained on it. But the case-study paper stresses that BigScience was as much a social experiment as a technical one: it deliberately foregrounded ethics, law, data governance, and the logistics of distributed collaboration, and the authors set out to share what they got right and what they would do differently in such large-scale participatory research.

BigScience showed that frontier-scale AI research did not have to happen only behind the walls of a few well-funded labs, and that an open, multidisciplinary community could ship both a major model and a substantial body of governance and ethics work.

For organizations, BigScience is a reference point for how open, collaborative AI development can be organized, and what it takes to do it responsibly across many institutions and countries.

Sources

Related