WaveRNN: Efficient Neural Audio Synthesis

“Efficient Neural Audio Synthesis,” submitted to arXiv on February 23, 2018 by Nal Kalchbrenner, Erich Elsen, Aaron van den Oord, and colleagues at DeepMind, introduced WaveRNN, a deliberately compact recurrent network for generating raw audio. A single-layer RNN with a dual softmax output matches WaveNet’s quality while generating audio about four times faster than real time on a GPU.

The paper goes further on efficiency. Weight pruning shows that large but very sparse networks, above 96 percent sparsity, outperform small dense ones, enabling real-time synthesis on a mobile CPU. A subscale generation scheme folds one long sequence into a batch of shorter ones to produce many samples at once without losing quality.

Why business readers should care: WaveRNN helped move high-quality neural speech synthesis onto everyday devices, not just data-center GPUs. On-device generation matters for latency, privacy, and cost, and the sparsity findings informed a broader push toward smaller, cheaper models.

WaveRNN: Efficient Neural Audio Synthesis

Sources

Related