Exercise: Bayesian models versus bootstrapping

Exercise: Bayesian models versus bootstrapping#

import torch
import matplotlib.pyplot as plt
from cycler import cycler
import seaborn as sns

# Set the color scheme
sns.set_theme()
colors = [
    "#0076C2",
    "#EC6842",
    "#A50034",
    "#009B77",
    "#FFB81C",
    "#E03C31",
    "#6CC24A",
    "#EF60A3",
    "#0C2340",
    "#00B8C8",
    "#6F1D77",
]
plt.rcParams["axes.prop_cycle"] = cycler(color=colors)
torch.set_default_dtype(torch.float64)
# Class that normalizes data to follow Normal(0, 1) distribution.
class normUnitvar:
    def __init__(self, fullDataset):
        self.normmean = fullDataset.mean()
        self.normstd = fullDataset.std()

    def normalize(self, data):
        return (data - self.normmean) / self.normstd

    def denormalize(self, data):
        return data * self.normstd + self.normmean

Introduction#

In this exercise, we will compare two methods that are used for determining the uncertainty of our model parameters and predictions: Bayesian models and bootstrapping. For this comparison, we will return to the linear models with radial basis functions we have seen before. Your task will be to determine confidence intervals for the predictions made by these models using both the Bayesian and the bootstrap approaches.

Before diving in, let’s recap what we have seen before. We have the function f_truth that generates the ground truth. Our data is generated by f_data, which takes the ground truth but corrupts it with i.i.d. Gaussian noise, representing the fact that we do not have perfect measurement devices. Mathematically, this can be expressed as

\[ t = f(\boldsymbol x) + \epsilon \]

where \(\epsilon \sim \mathcal{N}(\boldsymbol 0, \sigma_\epsilon^2 \boldsymbol I)\).

The predict function then fits a linear model to the data and makes predictions in arbitrary locations. Radial basis functions can be generated using the RadialBasisFunctions function. In the plot below, a linear model consisting of \(8\) radial basis functions with a characteristic length of \(1\) is fitted to the data.

# The true function relating t to x
def f_truth(x):
    # Return a sine
    return torch.sin(x)


# The data is generated from the ground truth with i.i.d. gaussian noise
def f_data(rng, epsilon, N):
    # Generate N evenly spaced observation locations
    x1 = torch.linspace(0, torch.pi, steps=N // 2)
    x2 = torch.linspace(1.5 * torch.pi, 2 * torch.pi, steps=(N + 1) // 2)
    x = torch.cat((x1, x2))

    # Generate N noisy observations (1 at each location)
    t = f_truth(x) + torch.normal(mean=0, std=epsilon, size=(N,), generator=rng)

    # Return both the locations and the observations
    return x, t
# Here is a function for the radial basis functions:
def RadialBasisFunctions(x, M_radial, l_radial):
    """
    A function that computes radial basis functions.

    Arguments:
    x        -  The datapoints
    M_radial -  The number of basis functions
    l_radial -  The width of each basis function
    """
    mu = torch.linspace(-2, 2, M_radial)
    num_basis = mu.shape[0]

    Phi = torch.zeros((x.shape[0], num_basis))
    for i in range(num_basis):
        Phi[:, i] = torch.exp(-0.5 * (x - mu[i]) ** 2 / l_radial**2)

    return Phi


# Define a function that makes a prediction at the given locations, based on the given (x,t) data
def predict_rbf(x, t, x_pred, M_radial, l_radial):
    # Get the Phi matrix
    Phi = RadialBasisFunctions(x, M_radial, l_radial)

    # Get the coefficient vector (Phi^T Phi)^-1 Phi^T t
    w = torch.linalg.solve(Phi.T @ Phi, Phi.T @ t)

    # Get the prediction Phi matrix
    Phi_pred = RadialBasisFunctions(x_pred, M_radial, l_radial)

    # Make a prediction in the prediction locations
    t_pred = Phi_pred @ w

    # Return the predicted values
    return t_pred
# Use a random number generator for reproducible results
rng = torch.manual_seed(0)

# We have 40 measurements, with a precision of 5
epsilon = 1.0 / 5.0
N = 40

# Generate the data and get the ground truth at the prediction locations
x, t = f_data(rng, epsilon=epsilon, N=N)
x_pred = torch.linspace(-1, 2 * torch.pi + 1, 1000)
f = f_truth(x_pred)

# Standardize the data
xscaler = normUnitvar(x)
x_sc = xscaler.normalize(x)
x_pred_sc = xscaler.normalize(x_pred)
tscaler = normUnitvar(t)
t_sc = tscaler.normalize(t)

# Use the linear model with radial basis functions to make predictions
t_pred_sc = predict_rbf(x_sc, t_sc, x_pred_sc, M_radial=8, l_radial=1)

# Scale the predictions back to the non-standardized space
t_pred = tscaler.denormalize(t_pred_sc)

# Plot the ground truth, data and predictions
fig, ax = plt.subplots(figsize=(6, 4), tight_layout=True)
ax.scatter(x, t, color="C0", marker="x", label="Training data")
ax.plot(x_pred, t_pred, color="C1", label="Model prediction")
ax.plot(x_pred, f, color="k", label="Ground truth")
ax.set_ylim((-2.5, 2.5))
ax.set_xlabel(r"$x$")
ax.set_ylabel(r"$t$")
ax.legend(loc="upper right")
plt.show()

In the code above, play around with the length scale and number of radial basis functions. What happens if l_radial=0.2? What about if l_radial=50? We can intuit qualitatively that some value in between gives us a better model than these extreme cases. However, we currently do not have a way to quantify how confident we can be about the predictions made by our models. This is where Bayesian Linear Regression and Bootstrapping come in.

Bayesian Linear Regression#

Following the Bayesian paradigm, one way of modeling the uncertainty associated with our model is to assign a prior distribution \(p(\boldsymbol w)\) to the coefficients \(\boldsymbol w\) of our linear model. Here, we will follow Chapter 3.3 of Bishop and assume a zero-mean isotropic normal distribution over the weights with a single precision parameter \(\alpha\):

\[ p(\boldsymbol w) = \mathcal{N}(\boldsymbol w | \boldsymbol 0, \alpha^{-1} \boldsymbol I) \]

Recall that we previously assumed that our data comes from a ground truth corrupted by i.i.d. Gaussian noise. As a result, the likelihood function is also given by a Gaussian:

\[ p(\boldsymbol t | \boldsymbol w) = \mathcal{N}(\boldsymbol t | \boldsymbol \Phi \boldsymbol w, \beta^{-1} \boldsymbol I) \]

Since the multiplication of two Gaussian distributions results in another Gaussian distribution, our posterior takes on the form of a Gaussian distribution as well:

\[ p(\boldsymbol w | \boldsymbol t) = \mathcal{N}(\boldsymbol w | \boldsymbol m_N, \boldsymbol S_N) \]

where

\[\begin{split} \boldsymbol m_N = \beta \boldsymbol S_N \boldsymbol \Phi^T \boldsymbol t \\ \boldsymbol S_N^{-1} = \alpha \boldsymbol I + \beta \boldsymbol \Phi^T \boldsymbol \Phi \end{split}\]

Combining the posterior distribution with our observation model, i.e. the conditional distribution \(p(\boldsymbol t | \boldsymbol w)\), gives us the predictive distribution \(p(\hat{\boldsymbol t} | \boldsymbol t)\) that we can use to predict \(\hat{\boldsymbol t}\) in a new set of locations \(\hat{\boldsymbol X}\). This distribution is again a Gaussian distribution:

\[ p(\hat{\boldsymbol t} | \boldsymbol t) = \mathcal{N}(\hat{\boldsymbol t} | \boldsymbol m_t, \boldsymbol S_t) \]

where

\[\begin{split} \boldsymbol m_t = \boldsymbol \Phi \boldsymbol m_N \\ \boldsymbol S_t = \beta^{-1} \boldsymbol I + \boldsymbol \Phi \boldsymbol S_N \boldsymbol \Phi^T \end{split}\]

In the code below, a start has been made to write a function that computes the posterior distribution based on the prior and observed data. Note that the general structure is quite similar to the predict_rbf() function, but we now need to compute a mean vector and covariance matrix for the weights and predictions instead of deterministic values.

Complete the code by adding the computation of the posterior mean vector and covariance matrix of the weights (m_N and S_N) and the posterior mean vector and covariance matrix of the predictions (m_pred and S_pred)

# Define a function that makes a prediction at the given locations, based on the given (x,t) data
def bayesian_predict_rbf(x, t, x_pred, M_radial, l_radial, alpha, beta):
    # Get the Phi matrix
    Phi = RadialBasisFunctions(x, M_radial, l_radial)

    # Get the prior mean and covariance
    # w ~ N(m_0, S_0)
    m_0 = torch.zeros(M_radial)
    S_0 = 1 / alpha * torch.eye(M_radial)
    S_0_inv = alpha * torch.eye(M_radial)

    # Get the posterior mean and covariance of the weights
    # w|t ~ N(m_N, S_N)

    # ---------------------- student exercise --------------------------------- #
    # YOUR CODE HERE
    # ---------------------- student exercise --------------------------------- #

    # Get the prediction Phi matrix
    Phi_pred = RadialBasisFunctions(x_pred, M_radial, l_radial)

    # Get the posterior mean and covariance of the predictions
    # t_hat|t ~ N(m_pred, S_pred)

    # ---------------------- student exercise --------------------------------- #
    # YOUR CODE HERE
    # ---------------------- student exercise --------------------------------- #

    # Return the predicted values
    return m_pred, S_pred

In the code below, we perform regression with the Bayesian linear model with radial basis functions. The main four parameters that we can play around with are:

  • Prior precision \(\alpha\) determines how peaked the prior distribution is. The higher \(\alpha\) is, the more peaked the distribution is.

  • Observation precision \(\beta\) determines our measurements’ precision. Since, in this case, we know that our data was generated with a standard deviation equal to \(\sigma_\epsilon\), we can set the precision to \(\beta = \sigma_\epsilon^{-2}\). You can think of this as us knowing beforehand how accurate our measurement device is.

  • The number of radial basis functions that we are fitting.

  • The length scale of the radial basis functions.

# Tune the prior covariance
alpha = 1
beta = 1 / epsilon**2
M_radial = 8
l_radial = 1

# Get the posterior mean and covariance of the prediction locations
m_pred_sc, S_pred_sc = bayesian_predict_rbf(
    x_sc, t_sc, x_pred_sc, M_radial=M_radial, l_radial=l_radial, alpha=alpha, beta=beta
)
m_pred = tscaler.denormalize(m_pred_sc)
S_pred = tscaler.normstd**2 * S_pred_sc
sigma_pred = torch.sqrt(S_pred.diagonal())

# Plot the ground truth, data and predictions
fig, ax = plt.subplots(figsize=(6, 4), tight_layout=True)
ax.scatter(x, t, color="C0", marker="x", label="Training data")
ax.plot(x_pred, m_pred, color="C1", label="Model prediction")
ax.fill_between(
    x_pred,
    m_pred - 1.96 * sigma_pred,
    m_pred + 1.96 * sigma_pred,
    color="C1",
    alpha=0.3,
)
ax.plot(x_pred, f, color="k", label="Ground truth")
ax.set_ylim((-2.5, 2.5))
ax.set_xlabel(r"$x$")
ax.set_ylabel(r"$t$")
ax.legend(loc="upper right")
plt.show()

In the figure above, you can see the fitted Bayesian linear model with radial basis functions. You can play around with the values of \(\alpha\), \(\beta\), and the length scale and number of radial basis functions. Try to answer the following questions:

  • What happens if you set the prior precision \(\alpha\) to a very large value? Why?

  • What happens with the posterior mean if you set the prior precision \(\alpha\) to a very low value? Why?

  • If you set l_radial=0.2 to trigger overfitting, what happens to the posterior standard deviation in the regions where the model is being overfitted?

  • Conversely, if you set l_radial=50 to make the model underfit, what happens to the posterior standard deviation?

  • How does the the assumed observation precision \(\beta\) affect the posterior distribution?

Some questions remain unanswered, for example: how do we choose the best values of \(\alpha\), \(\beta\), and length scale and number of radial basis functions? How can our Bayesian model detect underfitting? These question will be addressed in the next lecture and exercise notebook.

The Bootstrap#

Now, we consider a frequentist approach to including model uncertainty: the bootstrap. Instead of fitting a single model to the dataset, we generate multiple datasets from our original set and fit a model to each. We can then look at the variability between these models to estimate the model uncertainty. The bootstrapped datasets can be created by drawing \(N\) samples from our original dataset with replacement. This data selection means a single point in our original can appear multiple times in the bootstrapped dataset.

Complete the code below to generate a bootstrapped dataset from the original dataset. You can use the plot in the cell that follows to verify your implementation.

Hint: you can use the torch.randint() function to generate a random integer in a particular range.

def single_bootstrap(x, t, rng):
    # ---------------------- student exercise --------------------------------- #
    # YOUR CODE HERE
    # ---------------------- student exercise --------------------------------- #

    return x_bootstrap, t_bootstrap
x_b, t_b = single_bootstrap(x, t, rng)

fig, ax = plt.subplots(figsize=(6, 4), tight_layout=True)
ax.scatter(x, t, color="C0", marker="X", label="Original dataset")
ax.scatter(x_b, t_b, color="C1", marker="x", label="Bootstrapped dataset")
ax.set_xlabel(r"$x$")
ax.set_ylabel(r"$t$")
ax.legend(loc="upper right")
plt.show()

Now, we generate 50 bootstrapped datasets and fit the frequentist linear model with radial basis functions to each bootstrapped dataset. We can then compute a mean and standard deviation over these models to estimate how confident we can be about our model predictions.

rng = torch.manual_seed(0)

L = 50

T_pred = torch.zeros((len(x_pred), L))

for i in range(L):
    x_b, t_b = single_bootstrap(x, t, rng)

    # Standardize the bootstrapped datasets
    x_b_sc = xscaler.normalize(x_b)
    t_b_sc = tscaler.normalize(t_b)

    # Fit the linear model with radial basis functions to fit the bootstrapped data
    t_pred_sc = predict_rbf(x_b_sc, t_b_sc, x_pred_sc, M_radial=8, l_radial=1)

    # Scale the predictions back to the non-standardized space
    t_pred = tscaler.denormalize(t_pred_sc)
    T_pred[:, i] = t_pred

# Get the mean and standard deviation over the models
m_pred = torch.mean(T_pred, axis=1)
sigma_pred = torch.std(T_pred, axis=1)
# Plot the ground truth, data and predictions
fig, ax = plt.subplots(figsize=(6, 4), tight_layout=True)
ax.scatter(x, t, color="C0", marker="x", label="Training data")
ax.plot(x_pred, m_pred, color="C1", label="Average model prediction")
ax.fill_between(
    x_pred,
    m_pred - 1.96 * sigma_pred,
    m_pred + 1.96 * sigma_pred,
    color="C1",
    alpha=0.3,
)
for i in range(L):
    t_pred = T_pred[:, i]
    ax.plot(x_pred, t_pred, color="C1", linewidth=0.5, alpha=0.3)
ax.plot(x_pred, f, color="k", label="Ground truth")
ax.set_ylim((-2.5, 2.5))
ax.set_xlabel(r"$x$")
ax.set_ylabel(r"$t$")
ax.legend(loc="upper right")
plt.show()

The figure above shows the \(50\) bootstrapped models, their average prediction, and the corresponding \(95\%\) confidence interval. How does this compare to the Bayesian approach?

Try to answer the following questions:

  • If you again set l_radial=0.2 and l_radial=50 to make the model overfit and underfit, how does this affect the bootstrapped model?

  • How does the bootstrapped model compare to the Bayesian model from before? What are the similarities? What are the differences?

  • How does the number of bootstrapped datasets \(L\) affect the results? Does there exist an optimal value for \(L\) to choose?