AudioLM: a Language Modeling Approach to Audio Generation

“AudioLM: a Language Modeling Approach to Audio Generation,” submitted to arXiv on September 7, 2022 by Zalan Borsos, Neil Zeghidour, and colleagues at Google, applied the machinery of large language models to raw sound. The idea is to convert audio into a sequence of discrete tokens, then treat generation as next-token prediction, exactly the way a language model predicts the next word - except here the tokens stand for fragments of sound.

The hard part is that one kind of token cannot do both jobs well. AudioLM used a hybrid scheme: semantic tokens, drawn from a self-supervised model pretrained on audio, capture long-range structure like phonetics and melody, while acoustic tokens from a neural audio codec capture the fine detail needed for high-quality reconstruction. Given a few seconds of a prompt, the model produced natural continuations that kept the original speaker’s identity and prosody, and when trained on piano it continued musical passages coherently without ever being given a score or symbolic notation.

Why business readers should care: AudioLM showed that the next-token-prediction recipe behind text models transfers to audio, unifying speech and music generation under one framework. It is the direct technical ancestor of Google’s text-to-music systems and of the broader move to treat every modality as a sequence of tokens.

AudioLM: a Language Modeling Approach to Audio Generation

Sources

Related