Music Source Separation in the Waveform Domain (Demucs)

“Music Source Separation in the Waveform Domain,” submitted to arXiv on November 27, 2019 by Alexandre Defossez, Nicolas Usunier, Leon Bottou, and Francis Bach at Facebook AI Research, introduced Demucs, a model that pulls a finished song apart into its component tracks - vocals, drums, bass, and everything else. The novelty was that it operated directly on the audio waveform rather than on a spectrogram, the frequency picture most separation systems used at the time.

Most prior systems estimated a mask over the magnitude spectrogram, which throws away phase information and caps achievable quality. Demucs instead used an encoder-decoder architecture with a U-Net structure and a bidirectional LSTM in the middle, mapping waveform to waveform end to end. It reached 6.3 SDR on average on the standard MusDB benchmark and was judged more natural by human listeners than the leading spectrogram methods, though it sometimes leaked sound between sources such as vocals and other instruments.

Why business readers should care: stem separation is the quiet workhorse behind remixing, karaoke, sample clearance, transcription, and music-education tools. Demucs and its successors made high-quality separation good enough to build products on, and underpin features now common in consumer music apps.

Music Source Separation in the Waveform Domain (Demucs)

Sources

Related