“High Fidelity Neural Audio Compression,” submitted to arXiv on October 24, 2022 by Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi at Meta AI, introduced EnCodec, a neural network that compresses audio into a compact stream of discrete tokens and reconstructs it with high quality. It uses a streaming convolutional encoder-decoder with a residual vector quantizer, trained end to end, and adds a loss balancer that simplifies the usual hyperparameter tuning. Lightweight Transformers can squeeze the representation further.
EnCodec works across speech, noisy speech, and music at both 24 kHz mono and 48 kHz stereo, outperforming prior codecs at comparable bitrates. Meta released the code openly.
Why business readers should care: beyond efficient streaming and storage, EnCodec became infrastructure for generative audio. By turning sound into a vocabulary of tokens, it let models like MusicGen treat music and speech the way language models treat text, which is the bridge from compression to generation.