Hilbert-Space Approximations for Gaussian Processes
A Gaussian process is a beautifully flexible prior over functions, but exact inference means factorizing an covariance matrix, which costs time and memory and stalls on more than a few thousand points. The Hilbert-space approximation (HSGP) sidesteps this by writing the process as a short, fixed sum of deterministic basis functions with independent Gaussian weights, so the model becomes an ordinary linear regression whose cost grows roughly linearly in . The trick, due to Solin and Särkkä (2020) and made practical for probabilistic programming by Riutort-Mayol et al. (2023), is that a stationary kernel is diagonalized by the eigenfunctions of the Laplacian on a bounded domain.
and sample paths drawn from the finite HSGP basis expansion." />
Why exact GPs are expensive
Give a GP a mean of zero and a covariance (kernel) function , and the values are jointly normal with covariance matrix . Conditioning on data or drawing samples requires the Cholesky factor of (or of ), an operation, and storing is . Doubling the data multiplies the work eightfold, which is why plain GPs are a luxury reserved for small datasets unless some structure is exploited. The HSGP replaces with a low-rank factorization built from fixed basis functions, and linear algebra on that costs — linear in for a fixed budget .
Kernels, eigenfunctions, and the Laplacian
The idea rests on two classical facts. First, a stationary kernel is determined by its spectral density , the Fourier transform of ; this is Bochner’s theorem, and it is the same statement as writing a covariance function as a mixture of cosines of every frequency. Second, on a bounded interval the negative Laplacian has a discrete set of eigenfunctions and eigenvalues that form an orthonormal basis, and those eigenfunctions are exactly sines and cosines. Solin and Särkkä tie the two together: because generates the same frequencies that the spectral density weights, a stationary kernel is approximately diagonalized by the Laplacian eigenbasis, with the spectral density evaluated at each eigenfrequency giving that basis function’s variance.
On the symmetric domain with Dirichlet (zero) boundary conditions, the eigenvalues and eigenfunctions are Each is a sine wave that completes half-oscillations across the interval, and is its angular frequency: larger means faster wiggles, exactly the high-frequency content that a short lengthscale demands.
The Hilbert-space approximation
The approximation writes the process as a finite weighted sum of these eigenfunctions, where is the spectral density of the chosen kernel and the are independent standard-normal weights. Because the are fixed and known, this is just a linear model in features: the design matrix with entries is precomputed once, the amplitudes are simple functions of the kernel hyperparameters, and only the (plus the hyperparameters , ) are inferred. Equivalently the covariance is approximated by a rank- factorization whose accuracy improves as grows and as the domain comfortably contains the data.
Spectral densities
The whole recipe needs only the spectral density of the kernel, evaluated at the eigenfrequencies . For the squared-exponential (RBF) kernel in one dimension, a Gaussian in frequency that decays quickly, so a modest number of low-frequency basis functions suffices. For the Matérn kernel with smoothness , a power-law tail that decays more slowly, so rougher Matérn processes (small ) need more basis functions than the RBF to reach the same fidelity. See covariance functions for how , , and shape these kernels in the first place.
Two knobs: and the boundary factor
The approximation has exactly two tuning parameters, and both trade accuracy against cost.
- Number of basis functions . More terms resolve higher frequencies, so controls the shortest lengthscale the approximation can represent; the cost is , so you want as small as fidelity allows.
- Boundary factor . The domain half-width is set to , where is a scale for the data (e.g. half the range of , after centering). The approximation degrades near the edges of , so the domain must be padded beyond the data with (typically to ); too small and the fit warps at the boundaries, too large and you waste basis functions on empty space.
These two interact: pushing outward with a larger stretches the eigenfunctions, lowering every eigenfrequency , so a larger domain needs a larger to keep resolving the same short lengthscale. The accuracy/lengthscale tradeoff is the crux: short lengthscales mean rapidly wiggling functions with high-frequency content, and capturing them requires many basis functions, while long, smooth lengthscales are represented well by just a handful.
A worked example: how many basis functions?
Suppose the data are centered and scaled so their half-range is , and you pick a boundary factor , giving . The finest feature a basis of sines can represent has a half-wavelength of about across the half-domain, so to resolve a lengthscale we need roughly a Nyquist-style floor showing that the basis budget scales inversely with the lengthscale. For this gives basis functions as a bare minimum; because the RBF spectral density still carries weight a little above the Nyquist frequency, a safety margin of roughly keeps the approximation faithful. Halve the lengthscale to and the requirement doubles to , so very wiggly functions quickly erode the speedup — HSGP shines when the process is smooth relative to its domain. Riutort-Mayol et al. turn exactly this reasoning into practical diagnostics and recommendation tables for choosing and together from the inferred lengthscale.
In code
R
library(brms)
# brms fits a Hilbert-space approximate GP through the gp() term:
# k = number of basis functions (m), c = boundary factor.
# The HSGP eigenfunctions and spectral density are constructed internally,
# so the model stays a fast linear-in-basis regression fitted with Stan.
fit <- brm(
y ~ gp(x, k = 20, c = 1.5),
data = df, chains = 4, cores = 4
)
# The related s(...) smooth terms (brms/mgcv splines) are another low-rank
# basis idea, differing mainly in how the basis and its penalty are chosen.
conditional_effects(fit, effects = "x")
Python
import numpy as np
sigma, ell, L, m = 1.0, 0.6, 3.0, 30 # kernel + approximation settings
x = np.linspace(-L, L, 200)
j = np.arange(1, m + 1)
sqrt_lam = j * np.pi / (2 * L) # sqrt of Laplacian eigenvalues
phi = np.sqrt(1.0 / L) * np.sin(sqrt_lam[:, None] * (x + L)) # (m, n) basis
S = sigma**2 * np.sqrt(2 * np.pi) * ell * np.exp(-0.5 * (sqrt_lam * ell)**2)
K_hsgp = (phi * S[:, None]).T @ phi # rank-m approximate covariance
K_exact = sigma**2 * np.exp(-0.5 * (x[:, None] - x[None, :])**2 / ell**2)
print("basis functions m =", m)
print("max abs covariance err =", round(np.abs(K_hsgp - K_exact).max(), 4))
basis functions m = 30
max abs covariance err = 1.0
Julia
# Illustrative: build the sine eigenbasis and RBF spectral density by hand,
# then draw a sample path f(x) = sum_j sqrt(S(sqrt(lam_j))) phi_j(x) beta_j.
L, ell, sigma, m = 3.0, 0.6, 1.0, 30
sqrt_lam(j) = j * pi / (2L)
phi(j, x) = sqrt(1 / L) * sin(sqrt_lam(j) * (x + L))
S(w) = sigma^2 * sqrt(2pi) * ell * exp(-0.5 * (w * ell)^2)
x = range(-L, L; length = 200)
beta = randn(m) # beta_j ~ N(0, 1)
f = [sum(sqrt(S(sqrt_lam(j))) * phi(j, xi) * beta[j] for j in 1:m) for xi in x]
# In practice the Stan/Turing model infers beta together with sigma and ell.
Why it matters
Making GPs cheap turns them from a boutique method into a routine building block for Bayesian inference at scale.
A smooth spatial random field over thousands of locations, a temporal trend sampled at fine resolution, or a nonlinear covariate effect can all be given a GP prior and fitted with MCMC in minutes rather than hours, because each likelihood evaluation touches an -column design matrix instead of an kernel.
This is precisely why HSGP underlies the gp() terms in brms and the reference implementations in Stan, and why penalized-spline s(...) smooths are the workhorse for wiggly effects in the same models.
In infectious-disease work the payoff is direct: smooth space–time surfaces for prevalence mapping, flexible time-varying reproduction numbers, and delay or reporting effects can all be modeled as approximate GPs that actually finish sampling on real surveillance datasets.