BookCorpus and Aligning Books and Movies

“Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books” was submitted to arXiv on June 22, 2015 by Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. The paper’s research goal was to align passages in books with scenes in their movie adaptations, but its lasting legacy is a byproduct: the large corpus of book text the authors assembled to train their sentence embeddings, which became known as BookCorpus.

BookCorpus was built from free e-books and went on to become one of the most influential text datasets in the field. It was a training ingredient for BERT and the original GPT, among many others, because long-form book prose offered coherent, well-edited text that web pages often lack. For years it was cited far more for being training data than for the movie-alignment task it was created to serve.

The dataset also became a case study in provenance problems. The books were collected from a self-publishing site, the original release was later withdrawn, and researchers found duplicated titles and unclear licensing - issues that only surfaced after the corpus was already baked into widely used models. For a general reader, BookCorpus shows how a dataset gathered for one narrow experiment can become foundational infrastructure, carrying its unexamined assumptions and rights questions into systems used by millions.

BookCorpus and Aligning Books and Movies

Sources

Related