Datasheets for Datasets

“Datasheets for Datasets” was posted to arXiv on March 23, 2018 by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford. The proposal borrows directly from electronics: every physical component, the authors note, ships with a datasheet describing its characteristics, test results, and recommended uses. Machine learning datasets, by contrast, typically arrive with no standard documentation at all, which the paper argues can cause serious harm in high-stakes settings.

The remedy they propose is a datasheet that travels with each dataset and answers a fixed set of questions about its motivation (why it was created, by whom, and with what funding), its composition (what the instances are, how they were sampled, what labels exist, what is missing), its collection process, any preprocessing, and its recommended and discouraged uses. The aim is to surface assumptions and limitations before a dataset gets reused in a context its creators never intended.

Datasheets for datasets, alongside model cards, helped start a documentation movement now embedded in responsible-AI practice and in regulation. For a business reader, the underlying idea is practical risk management: knowing where data came from and what it is fit for is the difference between deploying a model responsibly and inheriting hidden liabilities you cannot see.

Sources

Related