DistilBERT: A Distilled Version of BERT

DistilBERT, from “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf at Hugging Face (submitted October 2, 2019), applies knowledge distillation to make BERT practical for everyday deployment. In distillation, a smaller student model is trained to imitate a larger teacher, learning not just the right answers but the full pattern of the teacher’s predictions.

The authors report that the technique lets them “reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.” They achieve this with “a triple loss combining language modeling, distillation and cosine-distance losses” applied during pretraining, so the compact model absorbs the teacher’s behavior from the start rather than being trained from scratch.

DistilBERT matters because it brought transformer-quality language understanding to settings with tight compute and latency budgets, including on-device and mobile use. It became one of the most downloaded models on Hugging Face and a popular default when a full-size BERT is too slow or expensive, showing that smaller models can be good enough for a great many real applications.

Sources

Last verified June 7, 2026