MS MARCO was introduced in “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset,” posted to arXiv on November 28, 2016 by Tri Nguyen and colleagues at Microsoft. Its name stands for MAchine Reading COmprehension, and what set it apart from earlier reading-comprehension sets was that its questions came from real people: more than a million anonymized questions sampled from Bing’s search query logs, each paired with a human-generated answer.
The dataset is built from genuine search behavior. It contains 1,010,916 questions and 182,669 fully human-rewritten answers, grounded in 8,841,823 passages extracted from 3,563,535 web documents that Bing retrieved. Reflecting real search, a question may have multiple valid answers or none at all. The authors framed several tasks of increasing difficulty, from judging whether a question is answerable and extracting an answer, to generating a well-formed natural-language answer, to ranking the retrieved passages.
MS MARCO became one of the most important benchmarks for both question answering and information retrieval, and its passage-ranking track in particular helped drive the rise of neural retrieval models that now underpin search and retrieval-augmented generation. For a general reader, it is a clear case of turning the exhaust of a real product - search logs - into training data that shaped how machines learn to find and phrase answers.