Gaussian Processes
Ordinary regression fits a function with a fixed number of parameters; a Gaussian process (GP) instead puts a probability distribution directly over functions themselves. Ask it for the value at any input and it returns not a single number but a whole Gaussian: a best guess plus a variance that widens where data are sparse and pinches shut near observations. That makes a GP a flexible interpolator that always carries its own error bars, which is why it underpins simulator emulation, spatial kriging, and probabilistic time-series models.
The definition
A Gaussian process is a collection of random variables, any finite subset of which is jointly multivariate normal. We write where the mean function sets the average value at each input and the covariance function (or kernel) says how strongly the function values at two inputs and co-vary. Together and specify the process completely: every property of the infinite-dimensional distribution over functions is encoded in these two functions of the inputs.
The definition becomes concrete the moment you pick a finite set of inputs. For inputs the vector of function values has an ordinary multivariate normal distribution where is the vector of means and is the covariance matrix (or Gram matrix) with entries ; see matrix notation for the conventions. A GP is thus a multivariate normal “of infinite length”, and any question you ask of it collapses to linear algebra on the relevant finite block. It is common, and harmless after centering, to take the mean function , so that all the structure lives in the kernel.
Sampling from the prior
To draw a function from the prior, choose a dense grid of inputs , build the covariance matrix , and sample the vector . The standard recipe factors by a Cholesky decomposition and sets for a vector of independent standard normals, which reproduces the target covariance because . Connecting the sampled values across the grid traces one sample function, and repeating with fresh gives the family of prior draws on the left of the figure. The kernel controls what these draws look like: with a smooth kernel neighbouring values are highly correlated, so the samples are smooth wiggly curves rather than jagged noise.
GP regression: conditioning on data
The point of a GP is prediction, and prediction is just conditioning a joint Gaussian on the part you have observed. Suppose we observe noisy data with independent Gaussian noise , collected in the vector at training inputs , and we want the function at a set of test inputs . Because the training outputs and the test values are jointly Gaussian, they stack into where is training-to-training, is training-to-test, and is test-to-test.
Applying the standard Gaussian conditioning formulas gives the posterior (or predictive) distribution with These two lines are the whole of GP regression. The posterior mean is a weighted combination of the observed outputs — write and the prediction at a test point is , a smooth blend that leans on nearby observations — so the matrix inverse does the interpolating. The posterior covariance starts from the prior covariance and subtracts what the data explain, so uncertainty shrinks near observations and returns to the prior far from them. Setting the observation noise to zero forces the posterior mean through every data point exactly; a positive lets the mean pass smoothly near noisy points instead of chasing each one. This is Bayesian inference with the GP as a prior over functions, and it generalizes linear regression: a GP with a linear kernel reproduces Bayesian linear regression exactly, while richer kernels buy nonlinearity for free.
The kernel and its hyperparameters
The kernel is where modelling choices live, because it dictates how function values borrow strength from one another. A ubiquitous default is the radial basis function (RBF, or squared-exponential) kernel governed by two hyperparameters. The lengthscale sets the distance over which the function stays correlated: small yields wiggly functions that turn over quickly, large yields gently varying ones. The signal variance sets the marginal amplitude, the typical vertical spread of the function around its mean. Alongside the kernel sits the noise variance , which separates genuine signal from observation error.
Different kernels encode different beliefs — smoothness, periodicity, roughness — and they can be added or multiplied to combine structure. This page keeps the kernel light on purpose; for the Matérn family, periodic and rational-quadratic kernels, and the rules for composing them, see the dedicated covariance functions page.
Marginal likelihood and fitting hyperparameters
The hyperparameters are not guessed but learned from data by maximizing the marginal likelihood — the probability of the observed outputs with the latent function integrated out. Because everything is Gaussian, this integral is available in closed form: where is the noisy covariance matrix and is its determinant. The three terms trade off against each other and give hyperparameter fitting its self-regularizing character: the term rewards a model that fits the data, the term penalizes overly flexible (complex) models, and the constant just normalizes. Maximizing this objective over — typically by gradient ascent, since the gradients are also closed-form — automatically balances fit against complexity, so a GP tends to choose a sensible lengthscale rather than overfitting.
Computational cost
The price of this flexibility is that every prediction and every likelihood evaluation needs , or more stably its Cholesky factor, for an matrix built from training points. That factorization costs time and memory, which is comfortable for thousands of points but prohibitive for millions. Scaling GPs to large is an active field; a common route is a basis-function approximation such as the Hilbert-space GP, which reduces the cost to roughly linear in .
A worked example
Take two training points and with noisy observations and , an RBF kernel with and , and noise variance . We want the posterior at the test point . The kernel entries use , so and . The noisy training covariance is with determinant . Solving gives the weights . The cross-covariance to the test point is , so the posterior mean is The posterior variance is , a standard deviation of about . So the GP predicts : sensibly between the two observations, with moderate uncertainty because sits in the gap where neither point pins it down tightly.
In code
R
# The GauPro and DiceKriging packages fit GPs directly; here GauPro on 1D data.
library(GauPro)
set.seed(1)
x <- matrix(c(0, 1), ncol = 1)
y <- c(1, 2)
gp <- GauPro(x, y, kernel = "matern52", nug = 0.1) # nugget = noise variance
pred <- gp$predict(matrix(0.5, ncol = 1), se.fit = TRUE)
$pred$mean # posterior mean at x* = 0.5
$pred$se # posterior standard deviation
$```
### Python
```python
import numpy as np
from scipy.linalg import cho_factor, cho_solve
rng = np.random.default_rng(0)
def rbf(a, b, ell=1.0, sf2=1.0): # RBF / squared-exponential kernel
d2 = (a[:, None] - b[None, :]) ** 2
return sf2 * np.exp(-0.5 * d2 / ell ** 2)
X = np.array([0.0, 1.0]) # training inputs
y = np.array([1.0, 2.0]) # noisy observations
sn2 = 0.1 # noise variance
Xs = np.array([-0.5, 0.5, 1.5]) # test inputs
L = cho_factor(rbf(X, X) + sn2 * np.eye(len(X))) # Cholesky of K + sn2 I
alpha = cho_solve(L, y) # weights (K + sn2 I)^-1 y
Ks = rbf(Xs, X)
mean = Ks @ alpha
cov = rbf(Xs, Xs) - Ks @ cho_solve(L, Ks.T) # posterior covariance
sd = np.sqrt(np.diag(cov))
for xi, mi, si in zip(Xs, mean, sd):
print(f"x*={xi:+.1f} mean={mi:.3f} sd={si:.3f}")
print("posterior draw:", np.round(rng.multivariate_normal(mean, cov), 3))
x*=-0.5 mean=0.496 sd=0.508
x*=+0.5 mean=1.551 sd=0.295
x*=+1.5 mean=1.626 sd=0.508
posterior draw: [0.511 1.744 1.551]
Julia
# GaussianProcesses.jl provides kernels, exact inference, and hyperparameter fits.
using GaussianProcesses, Random
Random.seed!(1)
x = [0.0, 1.0] # training inputs
y = [1.0, 2.0] # observations
kern = SE(0.0, 0.0) # RBF kernel: log lengthscale, log signal sd
gp = GP(x, y, MeanZero(), kern, log(sqrt(0.1))) # last arg: log noise sd
mu, sig2 = predict_y(gp, [0.5]) # posterior mean and variance at x* = 0.5
optimize!(gp) # maximize the marginal likelihood
Why it matters
A Gaussian process is the natural tool whenever you need predictions and honest uncertainty from limited, expensive data. In computer experiments it serves as an emulator (or surrogate): fit a GP to a handful of runs of a slow mechanistic simulator — an agent-based epidemic model, a climate model — and the cheap posterior mean stands in for the simulator, while the variance flags where another expensive run would most reduce ignorance, the engine behind Bayesian optimization and calibration. In spatial statistics the same mathematics is kriging: interpolate a disease rate, a pollutant, or a soil property across a landscape from scattered measurements, with the posterior variance mapping where the interpolation is trustworthy. In time series, GPs model smooth trends and periodic seasonality with kernels that add and multiply, giving forecasts with growing error bars into the future. Across all three the appeal is the same — a flexible, nonparametric fit that never forgets to tell you how much it does not know.