“FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” submitted to arXiv on June 8, 2020 by Yi Ren, Xu Tan, and colleagues at Microsoft and Zhejiang University, refined the original FastSpeech. The first version depended on a complex teacher-student distillation pipeline to obtain phoneme durations. FastSpeech 2 trains directly on ground-truth targets and instead conditions generation on extra variation information, pitch, energy, and more accurate durations, that makes one-to-many text-to-speech easier to learn.
The paper also introduces FastSpeech 2s, a variant that generates the speech waveform directly from text in parallel, described as the first fully end-to-end parallel TTS inference of its kind. The result is faster training, simpler pipelines, and higher voice quality than the original.
Why business readers should care: FastSpeech 2 is a workhorse design behind many fast, natural-sounding voices. Conditioning on prosodic features like pitch and energy gives synthetic speech more expressive control, which matters for assistants, audiobooks, and accessibility tools.