WaveNet: A Generative Model for Raw Audio

“WaveNet: A Generative Model for Raw Audio,” posted to arXiv on September 12, 2016 by Aaron van den Oord and colleagues at DeepMind, introduced a neural network that generates sound one audio sample at a time. Instead of stitching together pre-recorded speech fragments or driving a hand-built vocoder, WaveNet models the raw waveform directly: it predicts each of the roughly 16,000 samples per second of speech conditioned on all the samples that came before it. This autoregressive, fully probabilistic design used stacks of dilated causal convolutions to reach far back in time without an explosion in computation.

The headline result was in text-to-speech. WaveNet narrowed the gap to natural human speech by more than half over the best existing parametric and concatenative systems in both US English and Mandarin Chinese, according to listener tests. The same model could capture the voices of many different speakers and switch between them when conditioned on a speaker identity, and when trained on piano recordings it produced novel, often realistic musical fragments. It could even be run in reverse as a discriminative model for phoneme recognition.

The original WaveNet was far too slow to run in real products, since generating each sample required a full pass through the network. That sparked a wave of follow-on work on faster vocoders and made WaveNet the synthesis backbone of systems like Tacotron 2. Google deployed a distilled, parallelized version in its Assistant and Cloud Text-to-Speech.

Why business readers should care: WaveNet is the moment machine-generated speech stopped sounding robotic. Nearly every modern voice assistant, audiobook narrator, and AI voice clone descends from the idea of modeling raw audio directly rather than assembling it from canned pieces.

WaveNet: A Generative Model for Raw Audio

Sources

Related