“Neural Discrete Representation Learning,” posted to arXiv on November 2, 2017 by Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu of DeepMind, introduced the Vector Quantised-Variational AutoEncoder, or VQ-VAE. Where a standard variational autoencoder represents data with continuous latent variables, VQ-VAE learns a discrete codebook and maps each input to a sequence of entries from that codebook.
This discreteness brought two practical benefits. It sidestepped a common failure of VAEs called posterior collapse, where the model ignores its latent codes, and it produced compact, token-like representations that a separate powerful model could then learn to generate. In effect, VQ-VAE compresses an image, video, or audio clip into a grid or sequence of discrete symbols, and a second model learns the distribution over those symbols. The paper showed high-quality results across images, video, and speech, including the ability to generate coherent samples.
VQ-VAE proved enormously influential because its discrete-token approach connected continuous media to the autoregressive and transformer techniques that worked so well on text. This lineage runs directly into systems such as the original DALL-E, which generated images as sequences of discrete tokens, and into the latent representations used by later diffusion models. For a general reader, VQ-VAE is a key conceptual bridge: it let images and audio be treated, in part, like a language of discrete pieces.