Dense Passage Retrieval for Open-Domain Question Answering

For decades, finding relevant documents meant matching keywords, with methods like TF-IDF and BM25 counting which query terms appear in which documents. This works well when the query and the answer share words, but it fails when they use different vocabulary for the same idea. In 2020 Vladimir Karpukhin and colleagues at Facebook AI published Dense Passage Retrieval (DPR), demonstrating that a learned model could reliably beat keyword search.

DPR uses a dual-encoder design: one BERT-based encoder turns the question into a vector, another turns each passage into a vector, and relevance is measured by how close those vectors sit in embedding space. Because the passage vectors can be computed once and stored in advance, retrieval at query time is just a fast nearest-neighbor lookup. Trained on question-and-passage pairs, DPR outperformed the strong BM25 baseline by 9 to 19 percentage points in top-20 retrieval accuracy across several open-domain question-answering benchmarks, and improved end-to-end answer accuracy.

DPR helped move information retrieval from sparse keyword matching toward dense semantic search, and its dual-encoder pattern is now standard in the retrieval stage of retrieval-augmented generation systems that feed context to large language models.

For a general reader, this paper is a turning point in how machines find things: search engines and AI assistants increasingly match meaning rather than exact words, which is why a question phrased in your own terms can still surface the right passage.

Dense Passage Retrieval for Open-Domain Question Answering

Sources

Related