Brown Corpus

The Brown Corpus, formally “A Standard Corpus of Present-Day Edited American English, for use with Digital Computers,” was compiled by W. Nelson Francis and Henry Kucera at the Department of Linguistics of Brown University and first released in 1964, with revised editions in 1971 and 1979. It was the first carefully balanced, million-word collection of English text prepared specifically for computer analysis, and a copy is archived on the Internet Archive.

The corpus consists of 500 text samples of roughly 2,000 words each, totaling just over one million words, drawn entirely from edited prose published in the United States during 1961. Francis and Kucera spread the samples across genre categories - press reporting, editorials, government documents, learned writing, fiction, and more - so the corpus would represent a cross-section of the written language rather than any single source.

That deliberate, documented sampling design was its lasting innovation. It let researchers compute word frequencies, study grammar, and later add part-of-speech tags to every word, all on a shared, reproducible body of text. The Brown Corpus became the template for corpus linguistics and a direct ancestor of later annotated resources such as the Penn Treebank.

Why business readers should care: the Brown Corpus established a principle that still governs AI - the quality and representativeness of the data you collect, and how carefully you document it, shapes everything a model trained on it can learn.

Sources

Related