NaturalSpeech: End-to-End Text to Speech with Human-Level Quality

“NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,” submitted to arXiv on May 9, 2022 by Xu Tan, Jiawei Chen, Haohe Liu, and colleagues at Microsoft, set out to define and reach human-level quality in text-to-speech. The system uses a variational autoencoder with phoneme pre-training, a differentiable duration model, and matched prior and posterior distributions to close the gap between training and inference.

On the LJSpeech benchmark, NaturalSpeech reported a comparative mean opinion score of -0.01 against human recordings, with no statistically significant difference, the paper’s claim of human-level parity backed by formal listening tests. It framed “human-level quality” as a measurable, defensible bar rather than a marketing phrase.

Why business readers should care: NaturalSpeech marked the point where synthetic narration could be, by careful measurement, indistinguishable from a real voice on read speech. That has direct implications for audiobooks, accessibility, and media, and it sharpens the questions around disclosure and consent for synthetic voices.

Sources

Last verified June 7, 2026