Proof of the Data Processing Inequality
Theorem
Section titled “Theorem”[!abstract] Data Processing Inequality If forms a Markov chain (i.e., and are conditionally independent given ), then:
with equality if and only if also forms a Markov chain.
In words: Processing data can only destroy information, never create it.
Proof Strategy
Section titled “Proof Strategy”We’ll use the chain rule for mutual information and properties of conditional mutual information.
Preliminary Lemmas
Section titled “Preliminary Lemmas”Lemma 1: Chain Rule for Mutual Information
Section titled “Lemma 1: Chain Rule for Mutual Information”Statement:
Proof:
Lemma 2: Markov Chain Condition
Section titled “Lemma 2: Markov Chain Condition”Statement: is Markov iff
Proof: means . Thus , so conditioning on makes and independent, hence .
Main Proof
Section titled “Main Proof”Assume is a Markov chain.
Key Step
Section titled “Key Step”Apply the chain rule two ways:
First way:
Second way:
Conclusion
Section titled “Conclusion”Since is Markov, by Lemma 2: .
From the second expansion:
From the first expansion:
Since mutual information is non-negative, , therefore:
Interpretation
Section titled “Interpretation”This theorem has profound implications:
-
No algorithm can extract more information about from than was in . If is a lossy compression of , and is computed from , then has even less information about .
-
For the dissertation: Meme transmission is a Markov chain: . The DPI says the receiver can never have more information about the source than was in the transmitted message.
-
Communication bound: This is why channel capacity matters—it limits how much information can traverse the channel.
Corollaries
Section titled “Corollaries”-
Sufficient statistics: If is a sufficient statistic for , then —no information is lost.
-
Repeated processing: For any chain :
Generalizations
Section titled “Generalizations”- The inequality extends to continuous random variables.
- There’s a strengthened version involving the contraction coefficient.
- Related: Fano’s inequality provides a lower bound on error probability.
Historical Notes
Section titled “Historical Notes”The data processing inequality was implicit in Shannon’s 1948 paper but was formalized later. It’s sometimes called the “no free lunch theorem of information theory.”
Sources
Section titled “Sources”- Cover & Thomas, Elements of Information Theory, Theorem 2.8.1
- Shannon (1948), implicitly in channel coding theorem