Deriving the Maximum Entropy Distribution Under Mean Constraint

Goal

Given only the mean $\mu$ of a non-negative continuous random variable, find the probability density $p(x)$ that maximizes entropy. This is the principle of maximum entropy (MaxEnt)—assume nothing beyond what you know.

Starting Point

We want to maximize the differential entropy:

H[p] = -\int_0^\infty p(x) \ln p(x) \, dx

Subject to constraints:

Normalization: $\int_0^\infty p(x) \, dx = 1$
Mean constraint: $\int_0^\infty x \, p(x) \, dx = \mu$

Prerequisites

[[Lagrange Multipliers]]
[[Shannon Entropy]]
[[Calculus of Variations]] (basic)

Derivation

Step 1: Set Up the Lagrangian

We form the functional:

\mathcal{L}[p] = -\int_0^\infty p(x) \ln p(x) \, dx - \lambda_0 \left( \int_0^\infty p(x) \, dx - 1 \right) - \lambda_1 \left( \int_0^\infty x \, p(x) \, dx - \mu \right)

where $\lambda_0$ and $\lambda_1$ are Lagrange multipliers.

Step 2: Take the Functional Derivative

For the optimal $p^*(x)$ , the first variation must vanish:

\frac{\delta \mathcal{L}}{\delta p(x)} = 0

Computing term by term:

\frac{\delta}{\delta p(x)} \left[ -p(x) \ln p(x) \right] = -\ln p(x) - 1

\frac{\delta}{\delta p(x)} \left[ -\lambda_0 p(x) \right] = -\lambda_0

\frac{\delta}{\delta p(x)} \left[ -\lambda_1 x \, p(x) \right] = -\lambda_1 x

Setting the sum to zero:

-\ln p(x) - 1 - \lambda_0 - \lambda_1 x = 0

Step 3: Solve for $p(x)$

Rearranging:

\ln p(x) = -1 - \lambda_0 - \lambda_1 x

p(x) = e^{-1 - \lambda_0} \cdot e^{-\lambda_1 x}

Let $A = e^{-1 - \lambda_0}$ , so:

p(x) = A \, e^{-\lambda_1 x}

This is the exponential distribution!

Step 4: Determine the Constants

From normalization:

\int_0^\infty A \, e^{-\lambda_1 x} \, dx = A \cdot \frac{1}{\lambda_1} = 1 \implies A = \lambda_1

From the mean constraint:

\int_0^\infty x \cdot \lambda_1 e^{-\lambda_1 x} \, dx = \frac{1}{\lambda_1} = \mu \implies \lambda_1 = \frac{1}{\mu}

Result

[!success] Final Result The maximum entropy distribution for a non-negative random variable with known mean $\mu$ is the exponential distribution:
$p^*(x) = \frac{1}{\mu} e^{-x/\mu}, \quad x \geq 0$

Interpretation

This result says: if all you know about a non-negative quantity is its average, you should model it as exponentially distributed.

Why? Because the exponential distribution makes the fewest assumptions beyond what you’ve measured. Any other distribution would imply additional structure you don’t actually know.

This is Jaynes’ key insight: entropy maximization is principled ignorance.

Special Cases

When $\mu = 1$ : Standard exponential, $p(x) = e^{-x}$
In the limit $\mu \to 0$ : The distribution concentrates at $x = 0$ (deterministic)
In the limit $\mu \to \infty$ : The distribution spreads out, approaching uniform (but improper)

Common Mistakes

[!warning] Watch Out Don’t confuse this with maximizing entropy over all distributions on $[0, \infty)$ —that problem is ill-posed (no maximum exists without constraints). The mean constraint is essential.

Verification

Dimensional check: $\lambda_1$ has units of $1/x$ , so $\lambda_1 x$ is dimensionless. ✓

Limiting case: As $\mu \to \infty$ , entropy $H = 1 + \ln \mu \to \infty$ . This makes sense—more spread means more uncertainty. ✓

Alternative derivation: This can also be done via the partition function approach from statistical mechanics, giving the same answer.

Sources

Jaynes, E.T. (1957). “Information Theory and Statistical Mechanics”
Cover & Thomas, Elements of Information Theory, Chapter 12