Running BERT over every query-document pair gives excellent search relevance but is far too slow to apply to a large collection at query time. In 2020 Omar Khattab and Matei Zaharia at Stanford introduced ColBERT to resolve this tension between accuracy and speed.
ColBERT’s idea is “late interaction.” Rather than mashing the query and document together through BERT (expensive) or collapsing each into a single vector and comparing them once (fast but lossy), ColBERT encodes the query and the document separately into a bag of per-token vectors, then computes relevance with a cheap operation that, for each query token, finds its best-matching document token and sums those scores. Because document representations do not depend on the query, they can be computed and indexed offline. At search time only the lightweight matching step runs, so ColBERT preserves much of the fine-grained, term-level matching that makes BERT strong while being orders of magnitude faster.
The late-interaction approach occupies a useful middle ground between single-vector dense retrieval and full cross-attention re-ranking, and later versions improved its storage efficiency. It remains a widely cited and used method for high-quality passage search.
For a business reader, ColBERT illustrates the central engineering trade in modern search: most of the cost can be paid in advance, so the system can serve deep, language-model-quality relevance at the speed users expect.