“Gender Shades,” by Joy Buolamwini and Timnit Gebru, was published in 2018 at the first ACM Conference on Fairness, Accountability and Transparency (FAccT, then FAT*). The authors audited three commercial gender-classification systems from major vendors and tested how accurately each labeled a person as male or female across skin tone and gender. Crucially, they evaluated the systems intersectionally, breaking results down not just by gender or skin type alone but by the combination, which is where the worst failures hid.
The headline result was a stark gap. The systems were nearly perfect on lighter-skinned men, with a maximum error rate of 0.8 percent, but misclassified darker-skinned women at rates as high as 34.7 percent. To make the test fair, the authors built a new benchmark dataset balanced across skin tones, because existing face datasets were overwhelmingly lighter-skinned, which had let vendors report high overall accuracy while masking who the systems failed.
The paper became one of the most influential pieces of evidence that commercial AI vision products carry measurable demographic bias, and it directly shaped later corporate decisions to pause or exit facial-recognition sales and government scrutiny of the technology. Buolamwini’s follow-up advocacy, including the Algorithmic Justice League, grew out of this work.
Why a business reader should care: a model can post an impressive aggregate accuracy number and still fail badly for specific groups of customers or citizens. Testing performance broken down by subgroup, not just in aggregate, is the difference between a product that looks fair and one that is.