Chain Rule

The chain rule differentiates a composition of functions — a function of a function. It is arguably the single most important differentiation rule: it powers backpropagation in neural networks and the delta method in statistics. First-order drug decay C(t)=C0ektC(t) = C_0 e^{-kt} and composed dose–response functions are differentiated with the chain rule, which is also what lets the gradient of a log-likelihood pass through nested link functions during model fitting.

The rule

If y=f(g(x))y = f(g(x)), then

ddxf(g(x))=f(g(x))g(x).\frac{d}{dx} f\big(g(x)\big) = f'\big(g(x)\big)\,g'(x).

In Leibniz notation, with u=g(x)u = g(x),

dydx=dydududx.\frac{dy}{dx} = \frac{dy}{du}\,\frac{du}{dx}.

Intuition

Differentiate the outer function (leaving the inside alone), then multiply by the derivative of the inside. The rates multiply: if uu changes twice as fast as xx and yy changes three times as fast as uu, then yy changes six times as fast as xx.

Worked example 1: eλxe^{-\lambda x}

The exponential survival/decay term y=eλxy = e^{-\lambda x} is f(u)=euf(u) = e^{u} composed with u=g(x)=λxu = g(x) = -\lambda x, where g(x)=λg'(x) = -\lambda:

ddxeλx=eλx(λ)=λeλx.\frac{d}{dx} e^{-\lambda x} = e^{-\lambda x}\cdot(-\lambda) = -\lambda\,e^{-\lambda x}.

Worked example 2: (3x2+1)5(3x^2 + 1)^5

Here f(u)=u5f(u) = u^5 and u=g(x)=3x2+1u = g(x) = 3x^2 + 1, so f(u)=5u4f'(u) = 5u^4 and g(x)=6xg'(x) = 6x:

ddx(3x2+1)5=5(3x2+1)46x=30x(3x2+1)4.\frac{d}{dx}\big(3x^2 + 1\big)^5 = 5\,(3x^2 + 1)^4 \cdot 6x = 30x\,(3x^2 + 1)^4 .

At x=1x = 1: 301(4)4=30256=768030 \cdot 1 \cdot (4)^4 = 30 \cdot 256 = 7680.

Computing it

R

# Symbolic
D(expression((3*x^2 + 1)^5), "x")
#   5 * (3 * x^2 + 1)^4 * (3 * (2 * x))   == 30x(3x^2+1)^4

# Numeric check at x = 1
library(numDeriv)
grad(function(x) (3*x^2 + 1)^5, 1)   # 7680

Python

import sympy as sp
x, lam = sp.symbols("x lambda")
sp.diff(sp.exp(-lam * x), x)          # -lambda*exp(-lambda*x)
sp.diff((3*x**2 + 1)**5, x)           # 30*x*(3*x**2 + 1)**4

# Numeric check at x = 1
h = 1e-6
f = lambda x: (3*x**2 + 1)**5
(f(1 + h) - f(1 - h)) / (2 * h)       # ~7680.0

Julia

using Symbolics
@variables x λ
Symbolics.derivative(exp(-λ * x), x)        # -λ*exp(-λ*x)
Symbolics.derivative((3x^2 + 1)^5, x)        # 30x*(1 + 3(x^2))^4

using ForwardDiff
ForwardDiff.derivative(x -> (3x^2 + 1)^5, 1.0)   # 7680.0

Why it matters for statistics

The chain rule underlies the delta method, which approximates the variance of a transformed estimator h(θ^)h(\hat\theta) using [h(θ^)]2Var(θ^)\big[h'(\hat\theta)\big]^2 \operatorname{Var}(\hat\theta). It is also how gradients propagate through the layers of a model during backpropagation, making automatic differentiation and modern machine learning possible.