“The Mathematics of Statistical Machine Translation: Parameter Estimation,” by Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer of IBM’s Thomas J. Watson Research Center, appeared in the journal Computational Linguistics in 1993 (volume 19, number 2, pages 263-311). It is the foundational paper of statistical machine translation, the approach that displaced hand-written linguistic rules and dominated the field for two decades until neural networks took over.
The central move was to treat translation as a problem of probability rather than grammar. Borrowing the noisy-channel idea from speech recognition, the authors modeled the chance that a French sentence is the translation of an English one and then searched for the English sentence that made the observed French most probable. This reframing meant a system could be trained automatically from large collections of human-translated text instead of being programmed by linguists - the same data-over-rules philosophy that would later define modern AI.
The paper introduced a series of five increasingly detailed models, since known as the IBM Models. Each adds structure to the notion of word alignment - which source words gave rise to which target words - including how many target words a single source word tends to produce and how words get reordered. The authors estimated these models’ parameters from sentence pairs using the expectation-maximization algorithm, learning the alignments without anyone labeling them by hand. The IBM Models became standard tools and lived on inside later phrase-based systems and toolkits like Moses.
The work grew out of IBM’s speech-recognition group and its conviction, associated with Frederick Jelinek, that statistics and data beat linguistic intuition. That conviction was controversial at the time but proved enormously influential, making this paper a direct ancestor of every data-driven translation system that followed.