In January 2020 Nature published “International evaluation of an AI system for breast cancer screening” by Scott Mayer McKinney and colleagues at Google Health and DeepMind. It was a high-profile claim that an AI system could improve on radiologists at reading screening mammograms, evaluated on large datasets from both the United Kingdom and the United States.
The study reported that, compared with the clinical decisions actually made, the AI system produced an absolute reduction of 5.7 percent (US) and 1.2 percent (UK) in false positives and 9.4 percent (US) and 2.7 percent (UK) in false negatives. In a separate reader study with six radiologists, the system’s area under the ROC curve exceeded the average radiologist by 11.5 percentage points, and the authors suggested it could cut the second-reader workload in the UK’s double-reading process by 88 percent while remaining non-inferior.
The paper was influential but also became a focal point for debate about reproducibility in medical AI. A widely read commentary in Nature criticized the study for not releasing its code or model, arguing that without them the results could not be independently verified, a concern that shaped later expectations for transparency in clinical AI publications.
For a business or general reader, this milestone captures both the promise and the friction of medical AI: a credible result on a disease that affects millions, paired with a public argument over what evidence is needed before such a system can be trusted and deployed.