E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training

E5, published by Liang Wang and colleagues at Microsoft in December 2022, is a family of general-purpose text embeddings trained with contrastive learning. The name stands for EmbEddings from bidirEctional Encoder rEpresentations. Rather than relying on expensive human-labeled data, the team assembled a large curated dataset of text pairs called CCPairs from web sources and used weak supervision to teach the model which texts belong together.

Two results made E5 notable. It was the first model to beat the long-standing BM25 keyword baseline on the BEIR retrieval benchmark in a zero-shot setting, meaning it generalized to new domains without task-specific training. And when fine-tuned, it topped the MTEB benchmark while beating embedding models with 40 times more parameters, showing that careful data and training could outweigh raw size. It generalized well across 56 datasets covering retrieval, clustering, and classification.

E5 became one of the most widely used open embedding models and a frequent backbone for retrieval-augmented systems. For a business, it is a reminder that a smaller, well-trained open model can rival or beat much larger or proprietary options for search, often at a fraction of the serving cost.

E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training

Sources

Related