“Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” submitted to arXiv on December 16, 2017 by Jonathan Shen, Ruoming Pang, and colleagues at Google, described Tacotron 2, an end-to-end text-to-speech system. It works in two stages: a sequence-to-sequence network with attention turns a string of characters into a mel spectrogram, a compact picture of how the sound’s frequencies change over time, and a modified WaveNet then turns that spectrogram into an audio waveform.
The key insight was that the mel spectrogram is a good middle ground. Predicting it first, rather than predicting raw audio straight from text, let the team strip WaveNet down to a much simpler vocoder while keeping the quality. The result reached a mean opinion score of 4.53 on a 5-point naturalness scale, which the authors noted was close to the 4.58 they measured for professionally recorded human speech.
Tacotron 2 became the template for neural TTS for years: a text-to-spectrogram model paired with a separate neural vocoder. Later work mostly swapped in faster, non-autoregressive vocoders such as HiFi-GAN to replace the slow WaveNet stage while keeping the same overall shape.
Why business readers should care: Tacotron 2 made fully neural, near-human text-to-speech reproducible and practical, and the two-stage text-to-spectrogram-to-audio pipeline it popularized still underlies a large share of commercial voice products.