Parallel WaveNet: Fast High-Fidelity Speech Synthesis

“Parallel WaveNet: Fast High-Fidelity Speech Synthesis,” submitted to arXiv on November 28, 2017 by Aaron van den Oord, Yazhe Li, and colleagues at DeepMind, solved the main obstacle to deploying WaveNet. The original WaveNet generated audio one sample at a time, beautiful quality but far too slow for production. Parallel WaveNet introduces Probability Density Distillation, training a feed-forward network to imitate a pretrained WaveNet teacher so the student can generate all samples at once.

The distilled model produces speech more than 20 times faster than real time while preserving WaveNet’s quality. DeepMind reported that it was deployed in Google Assistant to serve synthetic voices across multiple languages.

Why business readers should care: Parallel WaveNet is the moment high-fidelity neural speech became fast enough for real products at scale. The distillation trick, transferring a slow but excellent model’s behavior into a fast one, is a recurring pattern for making research-grade quality practical.

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Sources

Related