DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa, from “DeBERTa: Decoding-enhanced BERT with Disentangled Attention” by Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen at Microsoft (submitted June 5, 2020), improves on BERT and RoBERTa with two architectural changes. The first is disentangled attention. As the authors describe, “each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices.” Keeping content and position separate lets the model reason more precisely about how words relate based on both what they are and where they sit.

The second change is an enhanced mask decoder that incorporates “absolute positions in the decoding layer to predict the masked tokens in model pre-training.” Together these gains were substantial: DeBERTa improved over RoBERTa-Large on MNLI, SQuAD v2.0, and RACE while using half the training data, and a DeBERTa model became the first to surpass the human baseline on the demanding SuperGLUE benchmark, scoring 89.9 against the human 89.8.

DeBERTa matters because it marked a symbolic milestone, a machine exceeding the human reference score on a broad suite of language understanding tasks, and because its disentangled treatment of position influenced later thinking about how transformers encode word order.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Sources

Related