“HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” submitted to arXiv on October 12, 2020 by Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, introduced a neural vocoder - the component that converts a mel spectrogram into a listenable audio waveform. The breakthrough was speed without a quality penalty. Earlier high-quality vocoders like WaveNet were autoregressive and painfully slow, while faster GAN-based vocoders had sounded worse.
The authors traced the quality gap to a specific failure: GAN vocoders were not modeling the periodic patterns that give audio its texture. HiFi-GAN’s generator uses a multi-receptive-field design and its discriminators are built to inspect those periodic structures. The result generated 22.05 kHz audio 167.9 times faster than real time on a single V100 GPU at near-human-level quality by mean opinion score, and a lightweight version ran 13.4 times faster than real time on a CPU.
HiFi-GAN became the default vocoder across the field, replacing the slow WaveNet stage in Tacotron 2-style pipelines and powering countless open-source and commercial text-to-speech and voice-cloning systems.
Why business readers should care: HiFi-GAN is the unglamorous piece that made high-quality neural speech cheap enough to ship. By turning spectrograms into audio in real time on ordinary hardware, it removed the cost barrier between research-grade voices and products people actually use.