HiFi-GAN generated speech audio about 168x faster than real time

In the 2020 paper introducing HiFi-GAN, the authors reported that their neural vocoder generated 22.05 kHz audio 167.9 times faster than real time on a single V100 GPU, while reaching near-human-level quality by mean opinion score. A lighter variant ran 13.4 times faster than real time on a CPU with comparable quality. That speed is what made the autoregressive WaveNet vocoder, which generated audio sample by sample, unnecessary for most production text-to-speech systems.

Sources

Last verified June 7, 2026