ColBERTv2, published by Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia in December 2021, refines the late-interaction approach to neural search. Most dense retrievers compress an entire document into one vector; ColBERT-style models instead keep a separate vector for every token and compare a query’s token vectors against a document’s, which captures fine-grained matches that a single summary vector misses. The catch was storage: those per-token representations took roughly ten times more space than alternatives.
ColBERTv2 fixed this with two ideas. Aggressive residual compression shrinks the multi-vector index by 6 to 10 times without losing quality, and a denoised supervision strategy improves training. The result is a retriever that reaches state-of-the-art quality both inside and outside its training domain while being cheap enough to store at scale.
Late interaction sits between fast-but-coarse single-vector search and slow-but-precise cross-encoder reranking, offering much of the accuracy of the latter at closer to the cost of the former. For organizations building search or retrieval-augmented systems over large document collections, ColBERTv2 made a high-accuracy retrieval method genuinely deployable rather than a research curiosity.