The reproducibility crisis in machine-learning science

As machine learning spread into the sciences, a recurring problem emerged: impressive reported results often could not be reproduced, and many were inflated by methodological errors rather than genuine predictive power. The single most common culprit is data leakage, where information from the test set, or information that would not be available at prediction time, contaminates the training process and produces optimistic accuracy that collapses in real use.

In “Leakage and the reproducibility crisis in machine-learning-based science,” published in the journal Patterns on August 4, 2023, Sayash Kapoor and Arvind Narayanan of Princeton surveyed the issue. They identified leakage in 17 distinct scientific fields, in cases collectively affecting 294 papers, and laid out a taxonomy of eight types of leakage - including inadequate separation of training and test data, using illegitimate features that encode the answer, temporal leakage in time-series problems, and non-independence between training and test samples. In several fields, correcting for leakage erased the claimed advantage of machine learning over simpler baselines.

The authors argued that the fix is better reporting discipline, proposing standardized “model info sheets” so that reviewers and readers can check whether a result is trustworthy.

For any organization relying on a predictive model, the practical message is to be skeptical of headline accuracy numbers. A model that scores well in a paper or a vendor pitch may simply have been allowed to peek at the answers; the only meaningful test is performance on data it could not have seen, evaluated the way it will actually be used.

The reproducibility crisis in machine-learning science

Sources

Related