SoundStream: An End-to-End Neural Audio Codec

“SoundStream: An End-to-End Neural Audio Codec,” submitted to arXiv on July 7, 2021 by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi at Google, presented a neural codec that encodes speech, music, and general audio at low bitrates from 3 to 18 kbps. It pairs a fully convolutional encoder-decoder with a residual vector quantizer, trained jointly end to end, and uses structured dropout so a single model can serve a range of bitrates without retraining.

SoundStream runs in real time on smartphone hardware and can perform compression and enhancement, such as noise suppression, in one pass. It was among the first learned codecs to clearly beat traditional ones at low bitrates.

Why business readers should care: SoundStream pioneered the discrete-token representation of audio that later powered Google’s AudioLM and influenced Meta’s EnCodec. Its quantization scheme is the template that turned audio into something generative models can read and write, while also enabling better low-bandwidth voice calls.

SoundStream: An End-to-End Neural Audio Codec

Sources

Related