HuBERT: Speech Representation Learning by Masked Prediction

“HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” submitted to arXiv on June 14, 2021 by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed at Facebook AI Research, adapted the masked-prediction idea behind BERT to raw audio. Speech has no built-in vocabulary of sound units, so the method first runs an offline clustering step to assign each short audio frame a discrete pseudo-label, then trains a model to predict those labels for masked regions of the waveform.

By iterating, using the model’s own improving representations to recluster and generate better targets, HuBERT learns features that capture both acoustic and language-like structure. The approach matched or beat wav2vec 2.0 on standard speech recognition benchmarks and became a widely used backbone for downstream speech tasks.

Why business readers should care: HuBERT is one of the foundation models that powers modern speech systems, from transcription to voice interfaces. Its self-supervised recipe means organizations can adapt a strong pretrained audio model to their own domain with relatively little labeled data.

Sources

Last verified June 7, 2026