Snorkel: Rapid Training Data Creation with Weak Supervision

“Snorkel: Rapid Training Data Creation with Weak Supervision” was submitted to arXiv on November 28, 2017 by Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Re at Stanford. It tackles the most expensive part of supervised machine learning: getting labeled training data. Instead of paying people to hand-label examples one by one, Snorkel lets users write labeling functions - short pieces of code that express arbitrary heuristics, with unknown and possibly conflicting accuracy.

The core trick is that Snorkel can combine and denoise these noisy labeling functions without ever seeing ground-truth labels, using an approach the authors call data programming. The system models how reliable each function is and how they correlate, then produces probabilistic training labels good enough to train a final model. In user studies, subject-matter experts built models 2.8 times faster and improved predictive performance by an average of 45.5 percent compared with seven hours of hand labeling.

Snorkel reframed labeled data as something you program rather than purchase, and the weak-supervision ideas it popularized were later adopted at companies including Google, Apple, and IBM. For a business reader, the lesson is that the bottleneck in many AI projects is not the model but the labeled data, and that encoding domain expertise as rules can be far cheaper than armies of annotators.

Sources

Last verified June 7, 2026