Jensen’s Inequality and Nonlinear Averaging

The average of a nonlinear function is not the function of the average. This single fact explains why plugging mean parameters into a nonlinear model gives biased predictions — a recurring trap in statistics and epidemiology.

The inequality

Let gg be a convex function and XX a random variable with finite mean. Then E[g(X)]g(E[X]).\mathbb{E}[g(X)] \ge g\big(\mathbb{E}[X]\big). For a concave function the inequality reverses: E[g(X)]g(E[X])\mathbb{E}[g(X)] \le g(\mathbb{E}[X]). Equality holds if and only if gg is linear on the support of XX, or XX is (almost surely) constant.

The intuition is geometric. Convexity means the chord lies above the curve, and more precisely every point of the curve has a supporting line beneath it. Taking expectations of that supporting line at μ=E[X]\mu = \mathbb{E}[X] gives the bound.

Second-order intuition: the variance gap

A Taylor expansion of gg about μ=E[X]\mu = \mathbb{E}[X] (the delta method) makes the size of the gap explicit: g(X)g(μ)+g(μ)(Xμ)+12g(μ)(Xμ)2.g(X) \approx g(\mu) + g'(\mu)(X-\mu) + \tfrac12 g''(\mu)(X-\mu)^2 . Taking expectations, the linear term vanishes (E[Xμ]=0\mathbb{E}[X-\mu]=0), leaving E[g(X)]g(μ)12g(μ)Var(X).\mathbb{E}[g(X)] - g(\mu) \approx \tfrac12\, g''(\mu)\,\operatorname{Var}(X). When gg is convex, g(μ)0g''(\mu) \ge 0, so the gap is non-negative — Jensen again. The gap grows with the curvature gg'' and with the spread Var(X)\operatorname{Var}(X).

Examples

Worked example

Let XX take the values 11 and 33, each with probability 12\tfrac12, and take the convex function g(x)=x2g(x) = x^2.

So E[g(X)]=54=g(E[X])\mathbb{E}[g(X)] = 5 \ge 4 = g(\mathbb{E}[X]), with a gap of 11.

Check the variance-gap formula: Var(X)=E[X2]μ2=54=1\operatorname{Var}(X) = \mathbb{E}[X^2] - \mu^2 = 5 - 4 = 1 and g(x)=2g''(x) = 2, so 12g(μ)Var(X)=12(2)(1)=1,\tfrac12 g''(\mu)\operatorname{Var}(X) = \tfrac12(2)(1) = 1, which matches the gap exactly — the approximation is exact here because gg is quadratic (higher derivatives vanish).

A biological example: performance in a fluctuating environment

Most biological rates — development, metabolism, photosynthesis, even pathogen transmission — depend on temperature through a curved thermal performance curve that peaks at an optimum and falls off on either side. Near that optimum the curve is concave, so Jensen’s inequality bites: an organism experiencing a fluctuating temperature performs worse, on average, than one held at the same mean temperature. Ecologists call plugging the mean temperature into a nonlinear rate the “fallacy of the averages” (Ruel & Ayres, 2001).

A concave thermal performance curve: because performance is concave near the optimum, the average of performance at 20 °C and 36 °C (0.37) sits far below the performance at their mean temperature of 28 °C (1.00).

Take a performance curve peaking at Topt=28CT_\text{opt} = 28^\circ\text{C} and a habitat that spends half its time at 2020^\circ and half at 3636^\circ — mean temperature exactly 2828^\circ. Performance at the mean temperature is the maximum, P(Tˉ)=1.00P(\bar T) = 1.00, but the mean performance is only 12[P(20)+P(36)]=0.37\tfrac12[P(20) + P(36)] = 0.37: variability alone costs 63% of performance, with no change in the average temperature. This is why climate variability, not just mean warming, reshapes development rates, vector activity, and transmission — and why a model fed the mean temperature over-predicts. The sign can flip: in the accelerating, convex low-temperature tail of the same curve, added variability would raise mean performance, exactly as 12P(μ)Var(T)\tfrac12 P''(\mu)\operatorname{Var}(T) predicts.

import numpy as np
Topt, width = 28.0, 8.0
P = lambda T: np.exp(-((T - Topt) / width) ** 2)   # concave near the optimum

T = np.array([20.0, 36.0])          # two equally likely temperatures, mean 28
print("P(mean T)  =", round(float(P(T.mean())), 3))   # 1.0   performance at the mean
print("E[P(T)]    =", round(float(P(T).mean()), 3))   # 0.368 mean performance
print("Jensen gap =", round(float(P(T.mean()) - P(T).mean()), 3))  # 0.632
P(mean T)  = 1.0
E[P(T)]    = 0.368
Jensen gap = 0.632

Simulation

R

set.seed(42)
g <- function(x) x^2
X <- runif(1e6, 0, 1)           # Uniform(0,1): mu = 1/2, Var = 1/12
lhs <- mean(g(X)); rhs <- g(mean(X))
c(E_gX = lhs, g_EX = rhs, gap = lhs - rhs)
# E_gX ~ 0.3333  g_EX ~ 0.25  gap ~ 0.0833  (>= 0, confirms Jensen)

gpp <- 2                        # g''(x) = 2
0.5 * gpp * var(X)              # ~ 0.0833: variance-gap approximation

Python

import numpy as np
np.random.seed(42)
g = lambda x: x**2
X = np.random.uniform(0, 1, 1_000_000)
lhs, rhs = g(X).mean(), g(X.mean())
print(lhs, rhs, lhs - rhs)      # ~0.3333 ~0.25 ~0.0833 (gap >= 0)
print(0.5 * 2 * X.var())        # ~0.0833: (1/2) g''(mu) Var(X)
0.3336193530309505 0.2503345980567165 0.08328475497423404
0.08328475497423413

Julia

using Random, Statistics
Random.seed!(42)
g(x) = x^2
X = rand(1_000_000)             # Uniform(0,1)
lhs, rhs = mean(g.(X)), g(mean(X))
println((lhs, rhs, lhs - rhs))  # ~(0.3333, 0.25, 0.0833)
println(0.5 * 2 * var(X))       # ~0.0833

For Uniform(0,1)\text{Uniform}(0,1) the exact gap is E[X2](EX)2=1314=1120.0833\mathbb{E}[X^2] - (\mathbb{E}X)^2 = \tfrac13 - \tfrac14 = \tfrac{1}{12}\approx 0.0833, matching both the simulation and the second-order formula.

Why it matters for statistics

Jensen’s inequality is the reason “average the inputs, then apply the model” disagrees with “apply the model, then average” whenever the model is nonlinear. It underlies the bias of plug-in estimators, the direction of the delta-method correction, the fact that E[loglikelihood]\mathbb{E}[\log \text{likelihood}] bounds motivate the EM algorithm and variational inference, and warnings against using mean parameters in nonlinear epidemic models. Knowing the sign (from convexity) and size (from the variance gap) of the discrepancy lets you correct for it.