ALBERT: A Lite BERT

ALBERT, from “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations” by Zhenzhong Lan and colleagues at Google Research (submitted September 26, 2019), set out to make BERT-style models smaller and cheaper to train without giving up accuracy. As models grew, memory limits and training cost became real obstacles, and ALBERT addresses both with two parameter-reduction techniques.

The first is factorized embeddings, which decompose the large word-embedding matrix into two smaller matrices so the vocabulary embeddings do not have to match the hidden layer size. The second is cross-layer parameter sharing, which reuses the same weights across transformer layers instead of learning separate parameters for each. ALBERT also replaces BERT’s next-sentence-prediction task with “a self-supervised loss that focuses on modeling inter-sentence coherence,” called sentence-order prediction. Despite having far fewer parameters than BERT-large, the best ALBERT model reached state-of-the-art results on GLUE, RACE, and SQuAD.

ALBERT matters because it was an early, influential demonstration that you can shrink a large language model substantially through smarter parameter design rather than just brute force. That mindset of doing more with fewer parameters runs through much of the later work on efficient models.

Sources

Related