Jorma Rissanen, then at IBM Research, published this paper in the journal Automatica in 1978, proposing a principle for choosing among competing statistical models. His starting observation is that any model, together with its parameters, can be used to encode an observed dataset, and the number of digits needed to write down that encoding depends on the model. A good model captures the regularities in the data and so allows a short description; a poor model does not.
From this Rissanen derived the minimum description length principle: select the model that minimizes the total number of digits needed to describe both the model and the data given the model. This neatly balances two competing pressures. A more complex model can fit the data more tightly, shortening the data part of the description, but it costs more digits to specify, lengthening the model part. The minimum sits where added complexity stops paying for itself.
This gives a principled, parameter-free answer to the perennial problem of overfitting, and it does so by recasting statistical inference as data compression, connecting it to information theory and to the algorithmic complexity ideas of Solomonoff and Kolmogorov.
The principle remains influential because it formalizes Occam’s razor in a way you can compute. The intuition that the model which best compresses your data is the one most likely to generalize underlies modern thinking about regularization and model selection across statistics and machine learning.