“Statistical Phrase-Based Translation,” by Philipp Koehn, Franz Josef Och, and Daniel Marcu, was presented at the 2003 Human Language Technology Conference of the North American chapter of the Association for Computational Linguistics (HLT-NAACL 2003), pages 127-133. It defined the framework that powered the best statistical translation systems, including the first decade of Google Translate, before neural networks arrived.
The earlier IBM Models translated by reasoning about individual words. Phrase-based translation instead worked with contiguous chunks of words - “phrases” in the statistical, not grammatical, sense. By learning that a multi-word sequence in one language tends to translate as a particular multi-word sequence in another, the system captured local context, idioms, and reordering that word-by-word models missed. A “with the chance of” maps cleanly onto its counterpart as a unit, whereas translating each word separately tends to produce stilted or wrong output.
The paper did more than propose the method; it ran controlled experiments to explain why phrase-based models beat word-based ones, examining the effect of phrase length and how phrases were extracted from word alignments. The authors found that relatively simple, heuristically extracted phrases delivered most of the gain, and that very long phrases added little. This combination of a strong method and a clear empirical account of why it worked made the paper highly influential.
Phrase-based translation became the production standard. Its ideas were packaged into the open-source Moses toolkit, which Koehn and others released in 2007, putting state-of-the-art statistical translation in the hands of any researcher or company. The approach held the field until encoder-decoder neural networks with attention overtook it around 2016.