Mutual information measures how much two random variables tell you about each other. It is built from Shannon entropy, the measure of uncertainty introduced in Claude Shannon’s 1948 paper A Mathematical Theory of Communication, and it answers a precise question: by how many bits does learning the value of one variable reduce your uncertainty about the other? If the two variables are independent, the answer is zero; the more tightly they are linked, the larger the mutual information.
Formally, the mutual information between an input and an output equals the entropy of the input minus its remaining entropy once the output is known. In Shannon’s communication setting, the input is the message you send and the output is what arrives after passing through a noisy channel, so mutual information captures exactly how much of the original signal survives the noise.
This quantity is what defines a channel’s capacity. Shannon’s noisy channel coding theorem shows that the maximum achievable rate of reliable communication over a channel equals the largest mutual information you can arrange between its input and output, a result that set the ultimate speed limit for all communication systems.
Mutual information also matters in data analysis and machine learning, where it serves as a general-purpose measure of statistical dependence. Unlike simple correlation, it detects relationships of any shape, which makes it useful for selecting informative features and for understanding how information flows through complex models.