“MusicLM: Generating Music From Text,” submitted to arXiv on January 26, 2023 by Andrea Agostinelli, Timo Denk, Zalan Borsos, and colleagues at Google, generated high-fidelity music from a text description such as “a calming violin melody backed by a distorted guitar riff.” It built on the AudioLM framing of audio as a hierarchy of tokens, casting text-to-music as a sequence-modeling task and producing music that stayed consistent over several minutes - a much harder feat than the short clips earlier systems managed.
MusicLM could be conditioned on text alone or on text plus a hummed or whistled melody, letting a user sketch a tune and have the model render it in a described style. Alongside the model, the team released MusicCaps, an evaluation dataset of 5,500 music-text pairs with rich descriptions written by human musicians, to give the field a shared yardstick.
Google did not release the MusicLM model itself at first, citing the risk of copying copyrighted material from training data, an early signal of the legal tension that would soon engulf AI music. A limited public version appeared later in Google’s AI Test Kitchen.
Why business readers should care: MusicLM proved that text-to-music could produce long, coherent, controllable pieces, setting the technical bar that consumer products like Suno and Udio would race to clear - and surfacing the training-data and copyright questions that now dominate the AI music business.