Optimization and Critical Points
Optimization is the search for a function’s highest or lowest value. It is the mathematical core of maximum likelihood estimation, least squares, and fitting disease models to data. Estimating a pathogen’s transmission rate by maximum likelihood, or choosing the vaccination coverage that minimizes total cases, are both optimization problems — and in each the optimum sits where the derivative is zero.
Critical points
At a smooth interior maximum or minimum, the tangent line is flat, so the derivative is zero. Points where
are called critical points (candidates for extrema). They may be maxima, minima, or saddle/inflection points, so they must be classified.
The second-derivative test
For a critical point with :
- if , the curve bends upward — is a local minimum;
- if , the curve bends downward — is a local maximum;
- if , the test is inconclusive.
Convex vs. concave
A function with everywhere is convex (bowl-shaped): any critical point is a global minimum. A function with everywhere is concave (dome-shaped): any critical point is a global maximum. This global guarantee is why convexity/concavity is so prized in optimization.
Worked example: a quadratic
Maximize . Differentiate and set to zero:
Check the second derivative: , so is a maximum. The maximum value is .
Monotonic transformations preserve the argmax
If is a strictly increasing function, then maximizes if and only if it maximizes — the location of the maximum is unchanged (only the height changes). Since is strictly increasing, maximizing a likelihood and maximizing the log-likelihood give the same estimate. Because the log turns products into sums, is far easier to differentiate — this is why MLE almost always works with the log-likelihood.
Worked example: a 1-D likelihood
Suppose we observe successes in independent trials with success probability . The log-likelihood is
Setting and solving gives the intuitive estimate . With , , .
Computing it
R
# Maximize f(x) = -(x-3)^2 + 5 (optimize minimizes, so negate)
f <- function(x) -(x - 3)^2 + 5
optimize(function(x) -f(x), interval = c(-10, 10))
# objective ~ -5 => maximum of f is 5 at x = 3
# Find the critical point directly via root-finding on f'(x) = -2(x-3)
uniroot(function(x) -2*(x - 3), interval = c(-10, 10))$root # 3
$```
### Python
```python
from scipy.optimize import minimize_scalar
f = lambda x: -(x - 3)**2 + 5
res = minimize_scalar(lambda x: -f(x))
print(res.x, -res.fun) # 3.0000000... 5.0
# MLE example: maximize the binomial log-likelihood, k=7, n=10
import numpy as np
k, n = 7, 10
nll = lambda p: -(k*np.log(p) + (n-k)*np.log(1-p))
r = minimize_scalar(nll, bounds=(1e-6, 1-1e-6), method="bounded")
print(r.x) # 0.6999... = k/n
3.0000000000000004 5.0
0.7000003717141288
Julia
using Optim
f(x) = -(x - 3)^2 + 5
res = optimize(x -> -f(x), -10.0, 10.0) # Brent's method
println(Optim.minimizer(res), " ", -Optim.minimum(res)) # 3.0 5.0
# Root-finding on the derivative via Optim's bracket is fine;
# for f'(x)=0 directly you can use Roots.jl:
# using Roots; find_zero(x -> -2*(x-3), (-10, 10)) # 3.0
Why it matters for statistics
Nearly every estimation method is an optimization: maximum likelihood maximizes , least squares minimizes a sum of squared residuals, and MAP estimation maximizes a posterior. The first-order condition produces the estimating equations, and the second-derivative (Hessian) governs both which extremum you found and the estimator’s variance.