“wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” submitted to arXiv on June 20, 2020 by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli at Facebook AI Research, showed that a speech recognizer could learn most of what it needs from audio that nobody had transcribed. The model masks parts of the speech signal in a latent space and is trained with a contrastive task to identify the correct quantized representation among distractors, learning useful structure before it ever sees a transcript.
The headline result was about data efficiency. After pre-training on large amounts of unlabeled audio, the model could be fine-tuned on tiny labeled sets and still reach strong word error rates. The paper reports 4.8/8.2 WER using only ten minutes of labeled data on top of 53,000 hours of unlabeled pre-training, an outcome that would have seemed implausible a few years earlier.
Why business readers should care: transcription quality used to depend on having thousands of hours of expensively labeled audio in a given language or domain. wav2vec 2.0 reframed the problem so that cheap, abundant raw audio does most of the work, opening practical speech recognition to low-resource languages and specialized vocabularies where labeled data is scarce.