Solution: Gaussian processes#
import torch
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
from cycler import cycler
import seaborn as sns
# Set the color scheme
sns.set_theme()
colors = [
"#0076C2",
"#EC6842",
"#A50034",
"#009B77",
"#FFB81C",
"#E03C31",
"#6CC24A",
"#EF60A3",
"#0C2340",
"#00B8C8",
"#6F1D77",
]
plt.rcParams["axes.prop_cycle"] = cycler(color=colors)
seed = 1
torch.manual_seed(seed)
np.random.seed(seed)
# By default torch uses float32, this can lead to numerical problems for the (semi-)positive definiteness of tensors.
torch.set_default_dtype(torch.float64)
Introduction#
The goal of this exercise is to develop a deeper understanding of how kernel methods work. We will start by implementing two covariance functions, both of which we will use throughout the rest of the notebook. Then we will create the general Gaussian Process framework and test it on a simple dataset. Finally, to get a good model, we perform model selection.
Let’s start by creating our dataset as usual:
# The true function relating t to x
def f_truth(x):
# Return a sine
return torch.sin(2 * x)
## Use the following slightly more complex function when comparing between the predictions made with different kernels
# return 2*torch.sin(2*x) + torch.exp(x/5)
# The data is generated from the ground truth with i.i.d. gaussian noise
def f_data(x, rng, epsilon=0.2):
# Generate N noisy observations (1 at each location)
t = f_truth(x) + torch.normal(mean=0, std=epsilon, size=x.shape, generator=rng)
return t
Kernel Engineering#
As we have seen in the lecture, many different covariance functions can be constructed from a set of elementary kernels, such as the polynomial or the Matérn kernels. The choice of this function has a large influence on the model’s prediction. In engineering applications, you often have some idea of what trends are present in the data, which can inform your choice for the covariance function.
Exercise: Implementing covariance functions#
Let’s start with the implementation of arguably the most popular choice of kernel, the so-called squared exponential kernel that you already know from the lecture:
It has two hyperparameters with the scaling factor \(\sigma_f\) and the length scale \(\ell\) that controls the correlation between data points. We pass these parameters as **hyperparams
to make it simple to switch between covariance functions with different parameters.
# Implement the covariance function
# (Note: you can use torch.cdist to compute ||x-x'|| )
def cov_gaussian(X1, X2, **hyperparams):
"""Gaussian (Bishop 6.23)"""
sig_f = hyperparams["sig_f"]
length = hyperparams["length"]
# ---------------------- student exercise --------------------------------- #
out = sig_f**2 * torch.exp(-torch.cdist(X1, X2) ** 2 / (2 * length**2))
# ---------------------- student exercise --------------------------------- #
return out
We set initial values for the parameters and plot the result:
hyperparams = {
"sig_f": 1,
"length": 1,
}
def plot_covariance(func):
fig, ax1 = plt.subplots(1, 1, figsize=(4, 3))
xlim = (-1, 1)
X = np.expand_dims(np.linspace(*xlim, 200), 1)
X = torch.tensor(X)
cov = func(X, X, **hyperparams)
# Plot covariance matrix
im = ax1.imshow(cov)
cbar = plt.colorbar(im, ax=ax1, fraction=0.045, pad=0.05)
cbar.ax.set_ylabel("$k(x,x)$", fontsize=10)
ax1.set_title(("Example of covariance matrix"))
ax1.set_xlabel("x", fontsize=13)
ax1.set_ylabel("x", fontsize=13)
ax1.grid(False)
plot_covariance(cov_gaussian)
While the squared exponential kernel is a popular choice for many applications, many other expressive kernels exist. An example that was not covered in the lecture is a exponentiated quadratic kernel with the addition of a constant and a linear term:
Implement this kernel or a custom kernel of your choice below.
def cov_quad_exp_add(X1, X2, **hyperparams):
"""Exponential of quadratic form with addition of constant and linear terms (Bishop 6.63)"""
theta0 = hyperparams["theta0"]
theta1 = hyperparams["theta1"]
theta2 = hyperparams["theta2"]
theta3 = hyperparams["theta3"]
# ---------------------- student exercise --------------------------------- #
out = (
theta0 * torch.exp(-0.5 * theta1 * torch.cdist(X1, X2) ** 2)
+ theta2
+ theta3 * torch.matmul(X1, X2.T)
)
# ---------------------- student exercise --------------------------------- #
return out
# Some initialization parameters, change these to observe what happens!
hyperparams["theta0"] = 2.25
hyperparams["theta1"] = 4
hyperparams["theta2"] = 2
hyperparams["theta3"] = 1
plot_covariance(cov_quad_exp_add)
We can use the squared exponential covariance for now. You can return to this and use a different one later!
cov_func = cov_gaussian
# cov_func = cov_quad_exp_add
Sample realizations#
# Sample from the Gaussian process distribution
n_points = 75 # Number of data points in each function
n_samples = 6 # Number of GP samples
# Independent variable samples
X = torch.linspace(-10, 10, n_points).view(-1, 1)
cov = cov_func(X, X, **hyperparams)
# Draw samples from a multivariate normal function with our covariance function
samples = torch.tensor(
np.random.multivariate_normal(mean=np.zeros(n_points), cov=cov, size=n_samples)
)
# Plot the sampled functions
fig, ax = plt.subplots(1, 1, figsize=(8, 3))
for i in range(n_samples):
ax.plot(X, samples[i], linestyle="-", marker="o", markersize=3)
ax.set_xlabel("$x$", fontsize=13)
ax.set_ylabel("$y = f(x)$", fontsize=13)
ax.set_title(f"{n_samples} different GP realizations at {n_points} points\n")
ax.set_xlim([-10, 10])
plt.show()
Exercise: Implementing the Gaussian Process#
Now we are ready to implement the Gaussian Process (GP)
In the lecture, we have already seen how to compute the posterior predictive distribution given the data in a single prediction location \(p(\hat{t}\vert\mathbf{t})\). This can be generalized to a vector of multiple prediction locations \(p(\hat{\mathbf{t}}\vert\mathbf{t})\). In that case, the posterior predictive distribution is given by:
with posterior mean
and posterior variance
Complete the function below to compute the posterior mean and covariance given the observed data \((\mathbf{X}, \mathbf{t})\) and a set of prediction locations \(\hat{\mathbf{X}}\).
# Gaussian process posterior
def GP(X, t, X_hat, kernel, hyperparams):
"""
:param X: Observation locations
:param t: Observation values
:param X_hat: Prediction locations
:param kernel: covariance function
:param hyperparams: The hyperparameters
:return: posterior mean and covariance matrix
"""
with torch.no_grad():
noise = hyperparams["noise"] # Note: noise**2 == beta^(-1)
# ---------------------- student exercise --------------------------------- #
# Kernel of the observations
k11 = kernel(X, X, **hyperparams)
# Kernel of observations vs to-predict
k12 = kernel(X, X_hat, **hyperparams)
C_inv = torch.inverse(k11 + noise**2 * torch.eye(k11.shape[0]))
kT_Cinv = torch.matmul(k12.T, C_inv)
# Compute posterior mean
mu = torch.matmul(kT_Cinv, t) # Bishop 6.66
# Compute the posterior covariance
C = kernel(X_hat, X_hat, **hyperparams) + noise**2 * torch.eye(X_hat.shape[0])
cov = C - torch.matmul(kT_Cinv, k12) # Bishop 6.67
# ---------------------- student exercise --------------------------------- #
return mu, cov
Performing Gaussian process regression#
Now we can perform regression using our GP model. We need to set the hyperparameters, let’s assume we know them for now.
hyperparams["noise"] = 0.2
hyperparams["sig_f"] = 1
hyperparams["length"] = 1
# Compute the posterior mean and covariance
n1 = 8 # Number of points to condition on (observation locations)
n2 = 250 # Number of points in posterior (prediction locations)
n_sample_post = 12 # Number of functions that will be sampled from the posterior
domain_data = [-4, 4]
domain = [-10, 10]
# Sample observations (X1, t) on the function
torch.manual_seed(seed)
X1 = torch.rand((n1, 1)) * (domain_data[1] - domain_data[0]) + domain_data[0]
rng = torch.manual_seed(1)
t = f_data(X1, rng)
# Predict points at uniform spacing to capture function
X2 = torch.linspace(domain[0], domain[1], n2).view(-1, 1)
# Compute posterior mean and covariance
mu_pred, cov_pred = GP(X1, t, X2, cov_func, hyperparams)
# Compute the standard deviation at the test points to be plotted
sig = torch.sqrt(torch.diag(cov_pred))
# Draw some samples of the posterior
y2 = torch.tensor(
np.random.multivariate_normal(
mean=mu_pred.view(-1),
cov=cov_pred - hyperparams["noise"] ** 2 * torch.eye(cov_pred.shape[0]),
size=n_sample_post,
)
)
Given these assumed hyperparameters, we can plot our predictions and the posterior distribution.
# Plot the postior distribution and some samples
def plot_preds(mu_pred, X1, X2, y2, t):
mu_pred = mu_pred.flatten()
X1 = X1.flatten()
X2 = X2.flatten()
t = t.flatten()
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(8, 6))
# Plot the distribution of the function (mean, covariance)
ax1.plot(X2, f_truth(X2), "k--", label="Ground truth")
ax1.fill_between(
X2.flatten(),
mu_pred - 1.96 * sig,
mu_pred + 1.96 * sig,
color="C0",
alpha=0.15,
label="95% conf. interval",
)
ax1.plot(X2, mu_pred, "-", color="C0", lw=2, label="Mean")
ax1.plot(X1, t, "ko", linewidth=2, label="Observations")
ax1.set_xlabel("$x$", fontsize=13)
ax1.set_ylabel("$y$", fontsize=13)
ax1.set_title("Posterior distribution and observation data")
ax1.axis([domain[0], domain[1], -3, 3])
ax1.legend(loc="upper left")
# Plot some samples from this function
ax2.plot(X2, y2.T, "-")
ax2.set_xlabel("$x$", fontsize=13)
ax2.set_ylabel("$y$", fontsize=13)
ax2.set_title(f"{n_sample_post} different function realizations from the posterior")
ax2.axis([domain[0], domain[1], -3, 3])
ax2.set_xlim([domain[0], domain[1]])
plt.tight_layout()
plt.show()
plot_preds(mu_pred, X1, X2, y2, t)
We have fit a GP to our data! The top figure is the most interesting one, spend some time thinking about whether the model predictions make sense.
Do the mean and variance reflect the observations accurately?
What happens to the predictions of the GP far away from the data points?
What do you expect to happen to happen here for the other kernel you implemented?
Try the other more complex
f_truth
function and observe what happens.
Model selection#
Now we turn to model selection. For simplicity we will only focus on using the squared exponential covariance function. We can infer the hyperparameters \(\sigma_f\), \(l\), and the noise directly from the data, without the need for a validation set. We will find the optimal choice for these parameters by minimizing the log marginal likelihood.
Exercise: Loss function#
Recall from the lecture that the log marginal likelihood reads
Implement a function that computes and returns the log marginal likelihood function in the following cell.
# Gaussian process log marginal likelihood
def GP_logmarglike(X, t, kernel, hyperparams):
"""
Calculate the log marginal likelihood based on the observations (X, t) for a given kernel
"""
# Kernel of the observations
k11 = kernel(X, X, **hyperparams)
C = k11 + hyperparams["noise"] ** 2 * torch.eye(k11.shape[0])
# ---------------------- student exercise --------------------------------- #
logmarglike = (
-0.5 * torch.sum(torch.log(torch.diagonal(C)))
- 0.5 * torch.matmul(t.T, torch.matmul(torch.inverse(C), t))
- 0.5 * len(t) * torch.log(torch.tensor(2 * torch.pi))
) # Bishop 6.69
# ---------------------- student exercise --------------------------------- #
return logmarglike
We can now optimize the hyperparameters. After selecting initial values, we perform a gradient-descent based optimization. This code is created for the squared exponential kernel, but it could be adapted for other kernels with relative ease.
# Initial values
sig_init = 1.0
noise_init = 1.0
length_init = 1.0
# Initialize logsigma, lognoise, and loglengthscale.
logsig = torch.log(torch.tensor([sig_init])).requires_grad_(True)
lognoise = torch.log(torch.tensor([noise_init])).requires_grad_(True)
loglen = torch.log(torch.tensor([length_init])).requires_grad_(True)
# We use the bfgs optimizer.
bfgs = torch.optim.LBFGS((logsig, lognoise, loglen), max_iter=50)
sigmas = []
noises = []
lens = []
# This function is called by the optimizer, it cannot have any arguments, therefore we use the global variables.
def update():
bfgs.zero_grad()
hyperparams["sig_f"] = torch.exp(logsig)
hyperparams["noise"] = torch.exp(lognoise)
hyperparams["length"] = torch.exp(loglen)
loss = -GP_logmarglike(X1, t, cov_func, hyperparams)
loss.backward()
sigmas.append(torch.exp(logsig).detach()[0].item())
noises.append(torch.exp(lognoise).detach()[0].item())
lens.append(torch.exp(loglen).detach()[0].item())
return loss
# We perform a single step. In this step, multiple iterations of the optimizer are performed.
bfgs.step(update)
print(
f"BFGS optimal result: sigma_f: {sigmas[-1]}, noise: {noises[-1]}, length scale: {lens[-1]}"
)
The initial values of the hyperparameters can play a big role, the optimization algorithm could get trapped in a local maximum. It is, therefore, best practice to try a couple different initializations. Here we plot what the result looks like for the estimated hyperparameters:
# Compute results
mu_pred, cov_pred = GP(X1, t, X2, cov_func, hyperparams)
# Compute the standard deviation at the test points to be plotted
sig = torch.sqrt(torch.diag(cov_pred))
# Draw some samples of the posterior
y2 = torch.tensor(
np.random.multivariate_normal(
mean=mu_pred.view(-1),
cov=cov_pred.detach() - hyperparams["noise"].detach() ** 2 * torch.eye(cov_pred.shape[0]),
size=n_sample_post,
)
)
# Plot
plot_preds(mu_pred, X1, X2, y2, t)
Review#
Guassian Processes regression is a useful non-parametric method. You’ve seen that we need to invert a matrix of size N x N, where N is the number of data points, to obtain the posterior distribution. This can become prohibitively expensive when dealing with large datasets. Hence, for some problems parametric models, e.g. neural networs, which’s comutational costs scale better with the dataset size are preferred. Still, GPs are a powerful tool — especially in the low data regime — and their inherent probabilistic nature makes them suitable for many engineering applications.
At this point you should have developped some understanding of the concepts behind Gaussian process regression and the central role the kernel function takes therein.