Skip to content

Deriving the Maximum Entropy Distribution Under Mean Constraint

Given only the mean μ\mu of a non-negative continuous random variable, find the probability density p(x)p(x) that maximizes entropy. This is the principle of maximum entropy (MaxEnt)—assume nothing beyond what you know.

We want to maximize the differential entropy:

H[p]=0p(x)lnp(x)dxH[p] = -\int_0^\infty p(x) \ln p(x) \, dx

Subject to constraints:

  1. Normalization: 0p(x)dx=1\int_0^\infty p(x) \, dx = 1
  2. Mean constraint: 0xp(x)dx=μ\int_0^\infty x \, p(x) \, dx = \mu
  • [[Lagrange Multipliers]]
  • [[Shannon Entropy]]
  • [[Calculus of Variations]] (basic)

We form the functional:

L[p]=0p(x)lnp(x)dxλ0(0p(x)dx1)λ1(0xp(x)dxμ)\mathcal{L}[p] = -\int_0^\infty p(x) \ln p(x) \, dx - \lambda_0 \left( \int_0^\infty p(x) \, dx - 1 \right) - \lambda_1 \left( \int_0^\infty x \, p(x) \, dx - \mu \right)

where λ0\lambda_0 and λ1\lambda_1 are Lagrange multipliers.

For the optimal p(x)p^*(x), the first variation must vanish:

δLδp(x)=0\frac{\delta \mathcal{L}}{\delta p(x)} = 0

Computing term by term:

δδp(x)[p(x)lnp(x)]=lnp(x)1\frac{\delta}{\delta p(x)} \left[ -p(x) \ln p(x) \right] = -\ln p(x) - 1 δδp(x)[λ0p(x)]=λ0\frac{\delta}{\delta p(x)} \left[ -\lambda_0 p(x) \right] = -\lambda_0 δδp(x)[λ1xp(x)]=λ1x\frac{\delta}{\delta p(x)} \left[ -\lambda_1 x \, p(x) \right] = -\lambda_1 x

Setting the sum to zero:

lnp(x)1λ0λ1x=0-\ln p(x) - 1 - \lambda_0 - \lambda_1 x = 0

Rearranging:

lnp(x)=1λ0λ1x\ln p(x) = -1 - \lambda_0 - \lambda_1 x p(x)=e1λ0eλ1xp(x) = e^{-1 - \lambda_0} \cdot e^{-\lambda_1 x}

Let A=e1λ0A = e^{-1 - \lambda_0}, so:

p(x)=Aeλ1xp(x) = A \, e^{-\lambda_1 x}

This is the exponential distribution!

From normalization:

0Aeλ1xdx=A1λ1=1    A=λ1\int_0^\infty A \, e^{-\lambda_1 x} \, dx = A \cdot \frac{1}{\lambda_1} = 1 \implies A = \lambda_1

From the mean constraint:

0xλ1eλ1xdx=1λ1=μ    λ1=1μ\int_0^\infty x \cdot \lambda_1 e^{-\lambda_1 x} \, dx = \frac{1}{\lambda_1} = \mu \implies \lambda_1 = \frac{1}{\mu}

[!success] Final Result The maximum entropy distribution for a non-negative random variable with known mean μ\mu is the exponential distribution:

p(x)=1μex/μ,x0p^*(x) = \frac{1}{\mu} e^{-x/\mu}, \quad x \geq 0

This result says: if all you know about a non-negative quantity is its average, you should model it as exponentially distributed.

Why? Because the exponential distribution makes the fewest assumptions beyond what you’ve measured. Any other distribution would imply additional structure you don’t actually know.

This is Jaynes’ key insight: entropy maximization is principled ignorance.

  1. When μ=1\mu = 1: Standard exponential, p(x)=exp(x) = e^{-x}

  2. In the limit μ0\mu \to 0: The distribution concentrates at x=0x = 0 (deterministic)

  3. In the limit μ\mu \to \infty: The distribution spreads out, approaching uniform (but improper)

[!warning] Watch Out Don’t confuse this with maximizing entropy over all distributions on [0,)[0, \infty)—that problem is ill-posed (no maximum exists without constraints). The mean constraint is essential.

Dimensional check: λ1\lambda_1 has units of 1/x1/x, so λ1x\lambda_1 x is dimensionless. ✓

Limiting case: As μ\mu \to \infty, entropy H=1+lnμH = 1 + \ln \mu \to \infty. This makes sense—more spread means more uncertainty. ✓

Alternative derivation: This can also be done via the partition function approach from statistical mechanics, giving the same answer.

  • Jaynes, E.T. (1957). “Information Theory and Statistical Mechanics”
  • Cover & Thomas, Elements of Information Theory, Chapter 12