OpenAI's Jukebox generates raw-audio music with singing

On April 30, 2020, OpenAI released Jukebox, described in the paper “Jukebox: A Generative Model for Music” by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Unlike earlier systems that generated symbolic note sequences, Jukebox produced music directly as raw audio, including rough but recognizable singing, and could be steered by genre, artist, and lyrics.

The model used a multi-scale VQ-VAE to compress audio into discrete codes, then trained autoregressive Transformers to model those codes - the same next-token prediction idea behind language models, applied to compressed sound. The paper reports the system could produce “high-fidelity and diverse songs with coherence up to multiple minutes.” OpenAI trained it on a curated dataset of 1.2 million songs paired with lyrics and metadata, and released model weights, code, and thousands of generated samples.

Jukebox followed OpenAI’s 2019 MuseNet, which generated multi-instrument MIDI compositions using sparse Transformers. Jukebox went further by working in the audio domain itself, where capturing a singing voice and instrumental texture is far harder than arranging notes. Its output was striking but flawed - warbly vocals and muddy fidelity - and it was slow to sample, which kept it a research artifact rather than a product. It nonetheless pointed directly at the consumer music generators that arrived a few years later.

Why business readers should care: Jukebox was the proof of concept for end-to-end AI music generation, the technical lineage that products like Suno and Udio would later turn into a market.

OpenAI's Jukebox generates raw-audio music with singing

Sources

Related