SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

“SampleRNN: An Unconditional End-to-End Neural Audio Generation Model,” submitted to arXiv on December 22, 2016 by Soroush Mehri, Kundan Kumar, and colleagues including Aaron Courville and Yoshua Bengio at the Montreal lab MILA, offered a recurrent-network alternative to WaveNet for generating raw audio sample by sample. It appeared at ICLR 2017, just months after WaveNet, and tackled the same hard problem: audio has tens of thousands of samples per second, so capturing both fine detail and long-range structure in one model is difficult.

SampleRNN’s answer was a hierarchy. Several modules run at different time scales, with slower recurrent networks at the top capturing coarse, longer-term structure and faster modules near the bottom, including simple autoregressive multilayer perceptrons, filling in the sample-level detail. Each level conditions the one below it. The model generated audio with no external input - unconditional generation - and human listeners preferred its output to competing methods across speech and music datasets.

Why business readers should care: SampleRNN and WaveNet together established in 2016 that neural networks could generate convincing raw audio from scratch, opening the line of research that leads to modern AI speech and music. SampleRNN is a reminder that the breakthrough was an approach, not a single architecture - recurrent and convolutional designs arrived at it almost simultaneously.

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Sources

Related