data2vec: A General Self-Supervised Framework for Speech, Vision and Language

“data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” submitted to arXiv on February 7, 2022 by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli at Meta AI, proposed using the same self-supervised method for three very different kinds of data. Instead of predicting modality-specific targets, like pixels, words, or audio units, data2vec predicts contextualized latent representations produced by a teacher copy of the model itself, learning from masked inputs through self-distillation.

Despite using one unified objective, the approach reached competitive or state-of-the-art results across speech recognition, image classification, and language understanding. It demonstrated that the core idea of self-supervised learning need not be reinvented per modality.

Why business readers should care: data2vec points toward simpler, more general AI systems where a single training recipe handles many data types. For organizations, that promises less bespoke engineering per problem and shared foundations across audio, vision, and text pipelines.

Sources

Last verified June 7, 2026