Shannon Entropy

Definition

Shannon entropy is a measure of the average uncertainty (or “surprise”) associated with a random variable. For a discrete random variable $X$ with possible outcomes $\{x_1, x_2, \ldots, x_n\}$ and probability mass function $p(x)$ , the entropy is defined as:

H(X) = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i)

By convention, $0 \log 0 = 0$ (justified by continuity).

Intuition

Entropy measures how surprised you expect to be when you learn the outcome of a random variable.

If you flip a fair coin, each outcome is equally likely—maximum surprise, maximum entropy.
If you flip a biased coin that lands heads 99% of the time, you’re rarely surprised—low entropy.

The key insight: entropy is the answer to “how many yes/no questions do I need, on average, to identify the outcome?”

Mathematical Formulation

Bits as Units

When using $\log_2$ , entropy is measured in bits. One bit is the entropy of a fair coin flip:

H(\text{fair coin}) = -\frac{1}{2}\log_2\frac{1}{2} - \frac{1}{2}\log_2\frac{1}{2} = 1 \text{ bit}

Alternative Bases

$\log_e$ (natural log): entropy in nats
$\log_{10}$ : entropy in hartleys (rarely used)

Conversion: $H_{\text{bits}} = H_{\text{nats}} / \ln 2$

Key Properties

Non-negativity: $H(X) \geq 0$ , with equality iff $X$ is deterministic.
Maximum entropy: For $n$ outcomes, $H(X) \leq \log_2 n$ , with equality iff $X$ is uniform.
Additivity for independent variables: $H(X, Y) = H(X) + H(Y)$ when $X \perp Y$ .
Concavity: $H(\lambda p + (1-\lambda) q) \geq \lambda H(p) + (1-\lambda) H(q)$
Chain rule: $H(X, Y) = H(X) + H(Y|X)$

Examples

Example 1: Fair Die

A fair six-sided die has:

H(\text{die}) = -6 \cdot \frac{1}{6} \log_2 \frac{1}{6} = \log_2 6 \approx 2.585 \text{ bits}

You need about 2.6 yes/no questions on average to identify which face came up.

Example 2: English Letters

If all 26 letters were equally likely: $H_{\text{max}} = \log_2 26 \approx 4.7 \text{ bits/letter}$

But English has non-uniform letter frequencies. Shannon estimated: $H_{\text{English}} \approx 1.0 - 1.5 \text{ bits/letter}$

This gap ( $4.7 - 1.3 \approx 3.4$ bits) is redundancy—it’s why compression works.

Connections

Relates to: [[Boltzmann Entropy]], [[Kullback-Leibler Divergence]], [[Mutual Information]]
Required for: [[Rate-Distortion Theory]], [[Channel Capacity]], [[Source Coding Theorem]]
Generalizes: [[Differential Entropy]] (continuous case)

Sources

Shannon, C. (1948). “A Mathematical Theory of Communication”
Cover & Thomas, Elements of Information Theory, Chapter 2
MacKay, Information Theory, Inference, and Learning Algorithms, Chapter 2

Open Questions

How does the choice of logarithm base affect information-theoretic arguments in the meme framework?
What’s the natural “base” for measuring memetic entropy—bits (binary), or something else?