Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

“Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,” submitted to arXiv on January 5, 2023 by Chengyi Wang and colleagues at Microsoft, introduced VALL-E, a text-to-speech system that could mimic an unseen speaker from just a 3-second recording. It reframed speech synthesis from a continuous-signal regression problem into a language-modeling one: the model predicts discrete audio tokens from an off-the-shelf neural codec, the same token-prediction recipe used for text.

Trained on 60,000 hours of English speech, VALL-E performed zero-shot voice cloning - it did not need to be fine-tuned on a new voice. Given a 3-second clip as an acoustic prompt plus the text to be spoken, it generated speech in that person’s voice, and it preserved not only timbre but also the speaker’s emotional tone and the acoustic character of the recording environment, such as a phone call or a room’s reverberation.

Microsoft declined to release VALL-E publicly, citing the obvious misuse risk: a 3-second sample is short enough to scrape from almost anyone, making convincing voice impersonation trivial. The work crystallized the deepfake-audio threat that voice-security and content-provenance efforts now grapple with.

Why business readers should care: VALL-E collapsed the data needed to clone a voice from minutes to seconds, turning realistic audio impersonation into a commodity capability. It is a central reference point for fraud risk, consent and likeness rights, and the case for audio watermarking and provenance.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

Sources

Related