VALL-E 2: Human Parity Zero-Shot Text to Speech

“VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers,” submitted to arXiv on June 8, 2024 by Sanyuan Chen, Shujie Liu, Long Zhou, and colleagues at Microsoft, extended the original VALL-E approach of treating speech synthesis as language modeling over audio codec tokens. It adds two techniques, Repetition Aware Sampling to stabilize decoding and Grouped Code Modeling to shorten sequences, which together improve robustness and quality.

The paper claims a milestone: the first system to reach human parity in zero-shot text-to-speech, meaning it can match a target speaker from a short reference recording and produce speech as natural and similar as human benchmarks on its evaluation sets. Microsoft positioned it as a research demonstration rather than a released product, citing misuse risks.

Why business readers should care: VALL-E 2 shows how close voice cloning has come to perfect, from just seconds of reference audio. That capability is powerful for accessibility and localization, but it also raises serious authentication, fraud, and consent concerns that organizations will need to manage.

VALL-E 2: Human Parity Zero-Shot Text to Speech

Sources

Related