“data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” submitted to arXiv on February 7, 2022 by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli at Meta AI, proposed using the same self-supervised method for three very different kinds of data. Instead of predicting modality-specific targets, like pixels, words, or audio units, data2vec predicts contextualized latent representations produced by a teacher copy of the model itself, learning from masked inputs through self-distillation.
Despite using one unified objective, the approach reached competitive or state-of-the-art results across speech recognition, image classification, and language understanding. It demonstrated that the core idea of self-supervised learning need not be reinvented per modality.
Why business readers should care: data2vec points toward simpler, more general AI systems where a single training recipe handles many data types. For organizations, that promises less bespoke engineering per problem and shared foundations across audio, vision, and text pipelines.