The Gradient

The gradient collects all of a function’s partial derivatives into a single vector that points in the direction of steepest ascent. It is the workhorse of optimization: gradient descent and maximum likelihood estimation both follow (or climb) the gradient.

Definition

For a scalar-valued function f(x1,,xn)f(x_1, \dots, x_n), the gradient is the vector of partial derivatives:

f=(fx1,  fx2,  ,  fxn).\nabla f = \left(\frac{\partial f}{\partial x_1},\; \frac{\partial f}{\partial x_2},\; \dots,\; \frac{\partial f}{\partial x_n}\right).

Direction of steepest ascent

At any point, f\nabla f points in the direction in which ff increases fastest, and its magnitude f\lVert \nabla f \rVert is that maximum rate of increase. The negative gradient f-\nabla f points in the direction of steepest descent — which is exactly why we step along f-\nabla f to minimize a loss. Where ff is at a maximum or minimum, f=0\nabla f = \mathbf{0}.

Relation to the Jacobian

The gradient is the Jacobian of a scalar-valued function — specifically its single row (or, by convention, that row transposed into a column). More generally the Jacobian stacks the gradients of each component of a vector-valued function as its rows.

Worked example: f(x,y)=x2+3y2f(x,y) = x^2 + 3y^2

The partials are fx=2x\dfrac{\partial f}{\partial x} = 2x and fy=6y\dfrac{\partial f}{\partial y} = 6y, so

f(x,y)=(2x,  6y).\nabla f(x, y) = (2x,\; 6y).

At (1,1)(1, 1), f=(2,6)\nabla f = (2, 6): ff climbs steepest heading in that direction, and (2,6)-(2, 6) is the descent direction toward the minimum at the origin.

Computing it

R

library(numDeriv)
f <- function(v) v[1]^2 + 3 * v[2]^2
grad(f, c(1, 1))   # 2  6

Python

import sympy as sp
x, y = sp.symbols("x y")
f = x**2 + 3*y**2
[sp.diff(f, x), sp.diff(f, y)]      # [2*x, 6*y]

# Numeric gradient with numpy
import numpy as np
g = lambda v: v[0]**2 + 3*v[1]**2
def num_grad(g, v, h=1e-6):
    v = np.asarray(v, float)
    return np.array([(g(v + h*e) - g(v - h*e)) / (2*h)
                     for e in np.eye(len(v))])
num_grad(g, [1, 1])                 # [2., 6.]

Julia

using ForwardDiff
f(v) = v[1]^2 + 3v[2]^2
ForwardDiff.gradient(f, [1.0, 1.0])   # [2.0, 6.0]

Gradient descent in one snippet

Follow f-\nabla f downhill toward the minimum at the origin:

using ForwardDiff
f(v) = v[1]^2 + 3v[2]^2
x = [5.0, 5.0]
η = 0.1                              # learning rate
for _ in 1:100
    x -= η * ForwardDiff.gradient(f, x)
end
x   # ≈ [0.0, 0.0]

Why it matters for statistics

Fitting a model almost always means optimizing an objective — minimizing a loss or maximizing a log-likelihood — and the gradient is the compass. Gradient ascent on the log-likelihood drives maximum likelihood estimation, while stochastic gradient descent trains the large models used in modern data science and disease forecasting.