Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)

“Precise Zero-Shot Dense Retrieval without Relevance Labels,” submitted to arXiv on December 20, 2022 by Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan, introduced HyDE - Hypothetical Document Embeddings - a counterintuitive way to retrieve relevant documents without any labeled training data.

Dense retrieval normally embeds the user’s query and searches for documents with nearby embeddings. The problem is that a short question and a full answer document often look quite different in embedding space. HyDE sidesteps this. First, an instruction-following model such as InstructGPT is asked to write a hypothetical document that answers the query. This document may contain made-up specifics, but it captures the shape and vocabulary of a real answer. Then an unsupervised contrastive encoder embeds that hypothetical document and uses it to search the real corpus. The encoder’s compression acts as a filter, the authors argue, washing out the invented details while keeping the relevant pattern.

HyDE substantially outperformed the baseline unsupervised retriever Contriever and reached performance comparable to fine-tuned retrievers, across web search, question answering, and fact verification, and in multiple languages including Swahili, Korean, and Japanese.

Why business readers should care: HyDE is a clever, training-free way to improve the retrieval half of a RAG system. Since retrieval quality sets a ceiling on how well a grounded AI can answer, techniques like this directly affect how accurate a document-search assistant feels.

Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)

Sources

Related