FastSpeech: Fast, Robust and Controllable Text to Speech

“FastSpeech: Fast, Robust and Controllable Text to Speech,” submitted to arXiv on May 22, 2019 by Yi Ren, Xu Tan, and colleagues from Zhejiang University and Microsoft, tackled a practical weakness of high-quality text-to-speech: speed. Autoregressive models like Tacotron generate a spectrogram frame by frame, which is slow and occasionally produces skipped or repeated words. FastSpeech uses a feed-forward Transformer that generates the whole mel-spectrogram in parallel.

To do this it borrows attention alignments from a teacher model to predict how long each phoneme should last, then expands the input accordingly. The paper reports synthesis roughly 38 times faster end to end than the autoregressive baseline, with comparable quality and far fewer alignment errors. An explicit duration control also lets users adjust speaking speed.

Why business readers should care: FastSpeech made high-quality synthetic speech fast and reliable enough for interactive use, from voice assistants to real-time narration. Its parallel, controllable design influenced the wave of efficient neural TTS systems that followed.

FastSpeech: Fast, Robust and Controllable Text to Speech

Sources

Related