Linear Basis Function Models

Linear Basis Function Models#

Introduction#

So far, we have been using a non-parametric model with k-nearest neighbors, meaning we needed access to the whole training dataset for each prediction. We will now focus on parametric models, namely linear models with basis functions. Parametric models are defined by a finite set of parameters calibrated in a training step. All we need for a prediction then are the parameter values. There is no longer a need to carry the whole dataset with us; the information used to make predictions is encoded in the model parameters. Once again, we will employ the simple sine function to demonstrate the concepts presented in this page.

Linear model#

The key idea behind linear regression models is that they are linear in their parameters \(\mathbf{w}\). They might be linear in their inputs as well, although this does not necessarily need to case, as we will see later on in this page. The simplest approach is to model our target function \(y(x)\) as a linear combination of the coordinates \(x\):

\[ y(x,\mathbf{w}) = w_0 + w_1 x_1 \]

In the one dimensional case, this is equivalent to fitting a straight line through our datapoints. The parameter \(w_0\), also referred to as bias (not to be confused with the model bias from the previous page), determines the intercept, \(w_1\) determines the slope. The introduction of a dummy input \(x_0 = 1\) allows us to write the model in a more concise way:

\[ y(x,\mathbf{w}) = w_0 x_0 + w_1 x_1 = \mathbf{w}^T \mathbf{x} \]

We will use the least squares error function from the previous page to fit our model, but will first show how this choice is motivated by a Maximum likelihood approach.

Maximum Likelihood Estimation#

Oftentimes, it is assumed that the target \(t\) is given by a deterministic function \(y(\mathbf{x}, \mathbf{w})\) with additive Gaussian noise so that

\[ t = y(\mathbf{x}, \mathbf w ) + \epsilon \hspace{0.6cm} \mathrm{with} \hspace{0.6cm} \epsilon \sim \mathcal{N} (0,\beta^{-1}), \]

with precision \(\beta\) (which is defined as \(1/\sigma^2\)). Note that this assumption is only justified in case of a unimodal conditional distribution for \(t\). It can have grave influence on the model accuracy and its validity therefore must be carefully assessed.

As seen in the previous page, for the square loss function, the optimal prediction for some value \(\mathbf{x}\) is given by the conditional mean of the target variable \(t\). In this case, this conditional mean is given by:

\[ \mathbb{E}[t|\mathbf{x}] = \int t p(t|\mathbf{x}) \mathrm{d}t = y(\mathbf{x},\mathbf{w}). \]

Given a new input \(\mathbf{x}\), the distribution of \(t\) that follows from our model is

\[ p (t | \mathbf{x}, \mathbf{w}, \beta) = \mathcal{N} (t | y (\mathbf{x}, \mathbf{w}), \beta^{-1}). \]

Consider now a dataset \(\mathcal{D}\) consisting of inputs \(\mathbf{X} = \{ \mathbf{x}_1, \dots, \mathbf{x}_n \}\) and targets \(\mathbf{t} = \{ t_1, \dots, t_n \}\). Assuming our datapoints are drawn independently from the same distribution (i.i.d. assumption), the likelihood of drawing this dataset from our model is

\[ p( \mathcal{D}|\mathbf{w}) = \prod_{n = 1}^N \mathcal{N} (t_n | y (\mathbf{x}_n, \mathbf{w}), \beta^{-1}), \]

also referred to as the likelihood function. Taking the logarithm and expanding on the classic expression for a multivariate Gaussian distribution gives:

\[ \mathrm{ln} \, p( \mathcal{D}|\mathbf{w}) = \sum_{n = 1}^{N} \mathrm{ln} \, \mathcal{N} ( t_n | \mathbf{w}^T \mathbf{x}_n, \beta^{-1}) = \frac{N}{2} \mathrm{ln} \beta - \frac{N}{2}\mathrm{ln}(2 \pi) - \beta \, \underbrace{\frac{1}{2} \sum_{n = 1}^{N} ( t_n - \mathbf{w}^T \mathbf{x}_n)^2}_{E_\mathcal{D}} \]

where we can identify our square-error loss function in the last term. Note that the first two terms are constant for a given dataset and have no influence on the parameter setting \(\bar{\mathbf{w}}\) that maximizes the likelihood. Those optimal parameters values \(\bar{\mathbf{w}}\) can be obtained by setting the gradient of our loss function w.r.t. \(\mathbf{w}\) to zero and solving for \(\mathbf{w}\).

\[ \nabla_{\mathbf{w}} E_{\mathcal{D}} = \frac{1}{N} \sum_{n=1}^N \left(t_n - \mathbf{w}^T \mathbf{x}_n \right) \mathbf{x}_n \stackrel{!}{=} 0 \]

It is convenient to concatenate all inputs to a design matrix \(\mathbf{X} = [\mathbf{x}_1^T, ..., \mathbf{x}_N^T]^T\). Solving for \(\mathbf{w}\) gives

\[ \bar{\mathbf{w}} = \left(\mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{t} \]

which is the classical expression for a least-squares solution you have by now seen many times during the course.

Data normalization#

Before we move on to fitting the model, we normalize our data. This step is recommended for most machine learning techniques and is often even necessary. A non-normalized dataset almost always leads to a numerically more challenging optimization problem. Part of model selection is to ensure that our basis functions show the desired behavior in the relevant parts of the domain. We center and rescale our data, a process referred to as standardization, to operate in the vicinity of the origin only. We use the StandardScaler class from the sklearn.preprocessing library to carry out the standardization. The standardized dataset \(\hat{\mathcal{D}} = ( \hat{x}, \hat{t} )\) is obtained by subtracting the sample mean \(\mu\), and dividing by the sample standard deviation \(\sigma\) of the data:

\[ \hat{x} = \frac{x - \mu_x}{\sigma_x} \quad \mathrm{and} \quad \hat{t} = \frac{t - \mu_t}{\sigma_t}. \]

Take a look below at the standardized and unstandardized data. Note that the standardization of the target \(t\) has a marginal effect, as the sine function is already centered at 0 and almost shows a standard deviation of 1. A 4 by 4 square has been added to indicate the region from \(-2 \hat{\sigma}\) to \(2 \hat{\sigma}\). As you can see, all input and output variables fall roughly in this interval, and this property allows for more stability when applying numerical solvers to the problem.

Note that there is no strictly correct way to shift and scale input data. Depending on the distribution of the data, a min-max scaling or a quantile scaling might lead to a better numerical setup. The dataset’s structure needs to be assessed carefully to make an informed decision on normalization.

With that out of the way, let us now define a few tools we need to state, solve, and visualize our problem.

Note that we are not inverting \(\mathbf{X}^T \mathbf{X}\), which is extremely expensive for large amounts of data. A more efficient way to obtain \(\mathbf{w}\) is to solve \(\mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{X}^T \mathbf{t}\). It is important to note that this system can only be solved if \(\mathbf{X}\) has full column rank.

It is clear from the plot that a linear model with linear features lacks the flexibility to fit the data well. A bias-variance decomposition analysis for this model would show that it has little variance but shows a strong bias. We now consider nonlinear functions of the input \(x\) as features/regressors to increase the flexibility of our linear model. A common approach is to use a set of polynomial basis functions,

\[ \phi_j(x) = x^j, \]

but numerous other choices are possible. The full formulation for a model with \(M\) polynomial basis functions is thus

\[ y(x,\mathbf{w}) = w_0 x^0 + w_1 x^1 + w_2 x^2 + ... + w_M x^M, \]

which shows how the model is still linear w.r.t. \(\mathbf{w}\), even though it is no longer linear in the input parameters. The design matrix for this more general case reads

\[ \Phi_{ij} = \phi_j(x_i). \]

As was the case for \(\mathbf{X}\), we need to ensure that \(\boldsymbol \Phi\) has full column rank. This is not always the case the case, for example if we have more basis functions that data points, or if our basis functions are not linearly independent.

Let’s implement a PolynomialBasis function.

We obtain the linear model with nonlinear basis functions by replacing the coordinate vector \(x\) with the feature vector \(\boldsymbol{\phi}(x)\)

\[ y(x,\mathbf{w}) = \sum_{j=0}^M w_j \phi_j(x) = \mathbf{w}^T \boldsymbol{\phi} (x). \]

The solution procedure remains the same, and we can solve for \(\bar{\mathbf{w}}\) directly

\[ \bar{\mathbf{w}} = \left(\boldsymbol{\Phi}^T \boldsymbol{\Phi} \right)^{-1} \boldsymbol{\Phi}^T \mathbf{t}. \]

Let’s take a look at the linear model with polynomial regressors.

That is looking much better already. However, the quality of the fit varies significantly with the degree of the polynomial basis. There seems to be an ideal model complexity for this specific problem. Try out the interactive tool below to get an idea of the interplay of the following variables:

\(p\), the degree of the polynomial basis
\(N\), the size of the training data set
\(freq\), the frequency of the underlying truth
\(\varepsilon\), the level of noise associated with the data
The seed can be updated to generate new random data sets
The truth can be hidden to simulate a situation that is closer to a practical setting

A few questions that might have crossed your minds when playing with the tool:

With a small amount of data (\(N \leq 11\)), what happens if we have as many data points as parameters? \((p + 1 = N)\)
With a small amount of data (\(N \leq 11\)), what happens if we have more model parameters than data? \((p + 1 > N)\)
We only have access to data in the interval \([0,2\pi]\). How well does our model extrapolate beyond the data range?

Other choices of basis functions#

As mentioned previously, the polynomial basis is just one choice among many to define our model. Depending on the problem setting, a different set of basis functions might lead to better results. Another popular choice is the radial basis functions (also called Gaussian basis functions), given by

\[ \phi_j(x) = \exp\left\{-\frac{(x-\mu_j)^2}{2\ell^2}\right\} \quad \mathrm{for} \quad j=1,\dots,M \]

where \(\phi_j\) is centered around \(\mu_j\), \(l\) determines the width, and \(M\) refers to the number of basis functions. Let’s implement a RadialBasisFunctions function:

One of the attributes of this model is the locality of its individual functions. This means data in one part of the domain will not impact predictions in other parts of the domain. Periodicity can be achieved with a Fourier basis. Wavelets are popular in signal processing since they are localized in both frequency and space. It is up to the user to determine which basis function properties are desired for a given problem, and this is an important part of model selection. Try to implement some of these basis functions yourself and assess how well they compare with the pre-implemented ones.

Let’s see how well the linear model with radial basis functions performs on the sine wave problem. Keep in mind that the lengthscale parameter corresponds to the lengthscale in the standardized space.

The figure above shows four different combinations of the hyperparameters (number of basis functions and length scale). The quality of the fit depends strongly on the parameter setting, but a visual inspection indicates our model can replicate the general trend.

Think about the following questions:

Do you notice any major differences in the plots?
Do you think normalization improves model fitting in this particular case? Compare with the polynomial basis.
If so, why does this happen?

Final remarks#

This page introduced generalized linear models with arbitrarily high flexibility. We have seen that increased flexibility is not always good if we perform a simple least-squares analysis. We know from the previous page that we can introduce a validation set to prevent our model from overfitting; however, removing features is not always trivial. The following page will introduce you to ridge regression, an elegant method for controlling the model complexity.