Linear Basis Function Models#

Here we come back to basis function models for regression but now through a Bayesian perspective. To make the discussion easier to follow, we again start at decision theory. We also draw parallels with the frequentist treatment we have seen before.

Decision theory and observation model#

We again have a process \(p(\mathbf{x},t)\) we would like to model with a regression function \(y(\mathbf{x})\). In doing that we incur a loss \(L\) whose expectation is:

(34)#\[ \mathbb{E}[L] = \displaystyle\int\int\left[y(\mathbf{x})-t\right]^2p(\mathbf{x},t)\,\mathrm{d}\mathbf{x}\mathrm{d}t \]

And we have seen in Decision Theory that the best option for \(y(\mathbf{x})\) is:

(35)#\[ y(\mathbf{x}) = \mathbb{E}[t\vert\mathbf{x}] \]

We must therefore come up with an expression for \(p(t\vert\mathbf{x})\). To make our lives easier, we opt to model this with a Gaussian. It is an attractive choice because its expectation is simply its mean.

This leads to the observation model:

(36)#\[ p(t\vert\mathbf{x}) = \mathcal{N}\left(t\vert y(\mathbf{x}),\beta^{-1}\right) \]

with \(\beta^{-1}\) being a variance parameter. Next we specify the shape of \(y(\mathbf{x})\).

Further Reading

For an extended discussion on decision theory for regression, read the short Section 1.5.5.

bishop-prml

Graph model and joint distribution#

To give \(y(\mathbf{x})\) some shape, we again go for a set of basis functions \(\boldsymbol{\phi}(\mathbf{x})\) which we assume are known and fixed. To make the model trainable we assume these basis functions are scaled by a set of weights \(\mathbf{w}\):

(37)#\[ y(\mathbf{x},\mathbf{w}) = \displaystyle\sum_jw_j\phi_j(\mathbf{x}) = \mathbf{w}^\mathrm{T}\boldsymbol{\phi}(\mathbf{x}) \]

We can pick different basis functions for different applications. Click below for a few examples:

../../../../../_images/8fbbb31ad8363ae2c47fafeba7ea60cc37dec78db0eb025dbf9e715b7f14cc5c.png
Functional form

\(\phi_j=x^j\quad\)

Additional parameters

None

Use cases

Functions with global support

../../../../../_images/aa133cff0cdb5955c8fb9d7ba4a3ce08a4070d7407cc00db8ebfc28c1bf0f7d5.png
Functional form

\(\phi_j=\exp\left[-\displaystyle\frac{(x-\mu_j)^2}{2s_j^2}\right]\quad\)

Additional parameters
  • \(\mu_j\): Center of each basis

  • \(s_j\): Length scale (area of influence)

Use cases

Functions with localized support, transitions.

../../../../../_images/1ab615428a25a720da7d9a48f8252c5aeb72fdcec3bb7470ca1495e666f62c97.png
Functional form

\(\phi_j = \displaystyle\frac{1}{1+\exp\left(\frac{\mu_j-x}{s_j}\right)}\quad\)

Additional parameters
  • \(\mu_j\): Location parameter

  • \(s_j\): Length scale (area of influence)

Use cases

Functions whose influence saturate over \(x\)

where we see that they can be nonlinear in \(\mathbf{x}\) while still being linear in \(\mathbf{w}\).

With this we can already define a graph for our model:

../../../../../_images/regressiongraph1.svg

Fig. 10 A probabilistic graph for regression.#

Note that \(\mathbf{x}\) is modeled as deterministic here. With this in mind, the joint distribution for this graph is simply:

(38)#\[ p(\mathbf{w},t\vert\mathbf{x}) = p(\mathbf{w})p(t\vert\mathbf{w},\mathbf{x}) \]

and we already decided to make \(p(t\vert\mathbf{w},\mathbf{x})\) a Gaussian.

But what to do about \(p(\mathbf{w})\)? At this point we can adopt a prior distribution over it. To make matters simple, let us also assume it is Gaussian:

(39)#\[ p(\mathbf{w}) = \mathcal{N}\left(\mathbf{w}\vert\boldsymbol{0},\alpha^{-1}\mathbf{I}\right) \]

where we are assuming a prior with zero mean and diagonal covariance. Let us add that to our graph:

../../../../../_images/regressiongraph2.svg

Fig. 11 A probabilistic graph for regression, now with hyperparameters.#

and we now also add the \(\beta\) parameter from the observation model. Since they govern the shapes of our distributions but not the actual values of \(\mathbf{w}\), we say \(\alpha\) and \(\beta\) are hyperparameters.

Our model is in principle ready to be used. But since the above prior is not very informative, we first need to observe some data to get a better idea of what the weights actually are.

Observing some data#

Suppose we now observe \(N\) samples of \(t\) values and their associated \(\mathbf{x}\) values. We gather these observations in a matrix \(\mathbf{X}=\left[\mathbf{x}_1,\dots,\mathbf{x}_N\right]\) and a column vector \(\mathbf{t}=\left[t_1,\dots,t_N\right]^\mathrm{T}\).

This changes our graph to:

../../../../../_images/regressiongraph3.svg

Fig. 12 A probabilistic graph for regression, now with observations.#

where the plate with the \(N\) inside means we should imagine these two nodes repeated \(N\) times.

Assuming these \(N\) observations are made independently and identically distributed (i.i.d.) from \(p(t\vert\mathbf{w},\mathbf{x})\), we can get a joint distribution for our whole dataset:

(40)#\[ p(\mathbf{t}\vert\mathbf{X},\mathbf{w},\beta) = \displaystyle\prod_{n=1}^N\mathcal{N}\left(t_n\vert\mathbf{w}^\mathrm{T}\boldsymbol{\phi}(\mathbf{x}_n),\beta^{-1}\right) \]

We are now standing at a crossroads. We can either take a shortcut and get a single estimate for \(\mathbf{w}\) or we can go fully Bayesian and compute the posterior with:

(41)#\[ p(\mathbf{w}\vert\mathbf{t}) = \displaystyle\frac {p(\mathbf{t}\vert\mathbf{w})p(\mathbf{w})} {p(\mathbf{t})} \]

Two deterministic shortcuts to \(\mathbf{w}\)#

First we take two possible shortcuts to a quick estimate for the weights. We also discuss why they are not always a good choice.

For the first one we note that Eq. (40) is a multivariate Gaussian for \(\mathbf{t}\) (the product of two or more Gaussians is also a Gaussian). We can therefore go the frequentist way:

Maximum Likelihood Estimation (MLE)

Given a likelihood function \(p(\mathbf{t}\vert\mathbf{w})\), compute \(\mathbf{w}\) that maximizes this likelihood. The prior over \(\mathbf{w}\) is ignored and only a single point estimate \(\mathbf{w}_\mathrm{MLE}\) remains.

Taking the logarithm of Eq. (40) turns the product into a sum and gets rid of the exponentials inside (recall the form of a univariate Gaussian):

(42)#\[ \ln p(\mathbf{t}\vert\mathbf{X},\mathbf{w},\beta) = \displaystyle\frac{N}{2}\ln\beta - \frac{N}{2}\ln(2\pi)-{\color{red}\frac{\beta}{2}\sum_{n=1}^{N}\left[t_n-\mathbf{w}^\mathrm{T}\boldsymbol{\phi}(\mathbf{x}_n)\right]^2} \]

Note the term in red above. Maximizing Eq. (42) is exactly equivalent to minimizing this term. We therefore get the least squares solution from frequentist regression. Recall that this works well when we have a lot of data but tends to overfit to small datasets.

The second shortcut already uses Eq. (41), so it is a better compromise, although it is still a single point estimate for the weights:

Maximum A Posteriori (MAP)

Given a likelihood and a prior, use Bayes’ theorem to compute a posterior \(p(\mathbf{w}\vert\mathbf{t})\) but keep only the most likely value of \(\mathbf{w}\) as estimate. The prior is taken into account but the uncertainty associated with \(\mathbf{w}\) is ignored.

To get a MAP estimate from Eq. (41) it suffices to look at the numerator (the unnormalized posterior). Taking the logarithm and isolating only what depends on \(\mathbf{w}\) we have:

(43)#\[ \ln p(\mathbf{w}\vert\mathbf{t}) \propto \displaystyle -\frac{\beta}{2}\sum_{n=1}^{N}\left[t_n-\mathbf{w}^\mathrm{T}\boldsymbol{\phi}(\mathbf{x}_n)\right]^2 {\color{red}- \frac{\alpha}{2}\mathbf{w}^\mathrm{T}\mathbf{w}} \]

Getting the most likely value for \(\mathbf{w}\) means maximizing the expression above. Now compare this to the expression for regularized least squares you have seen before, especially the term in red. The MAP solution is exactly equivalent to least squares with \(L_2\) regularization!

These results are quite neat, as they unify all the regression approaches we have seen thus far.

Further Reading

Make sure you understand how MLE, MAP and the Bayesian posterior for \(\mathbf{w}\) are related to each other. Reading Section 3.2 and the first two pages of Section 3.3 will give you a different flavor of this same discussion.

bishop-prml

The Bayesian way#

Finally, we can get a proper probability density for \(p(\mathbf{w}\vert\mathbf{t})\) by using the Bayes’ theorem in Eq. (41) consistently. Note that the likelihood \(p(\mathbf{t}\vert\mathbf{w})\) is a conditional Gaussian, \(p(\mathbf{w})\) is a marginal Gaussian, and the joint distribution (the numerator) is also Gaussian. This means we can directly use the standard expressions we derived before to easily derive a posterior:

(44)#\[ p(\mathbf{w}\vert\mathbf{t}) = \mathcal{N}\left(\mathbf{w}\vert\mathbf{m},\mathbf{S}\right) \]

where the mean of the posterior is:

(45)#\[ \mathbf{m} = \beta\mathbf{S}\boldsymbol{\Phi}^\mathrm{T}\mathbf{t} \]

and its covariance:

(46)#\[ \mathbf{S}^{-1} = \alpha\mathbf{I} + \beta\boldsymbol{\Phi}^\mathrm{T}\boldsymbol{\Phi} \]

where \(\Phi_{ij} = \phi_j(x_i)\) already appeared in the MLE treatment.

Making new predictions#

We are almost done! We now have a more informed distribution \(p(\mathbf{w}\vert\mathbf{t})\). All that remains now is to finally make new predictions.

Suppose we now have a new input value \(\hat{\mathbf{x}}\) and we would like to predict \(\hat{t}\). We make one final change to our graph:

../../../../../_images/regressiongraph4.svg

Fig. 13 A probabilistic graph for regression, final version.#

The joint distribution for this final graph is:

(47)#\[ p(\hat{t},\mathbf{t},\mathbf{w}\vert\alpha,\beta,\mathbf{X},\hat{\mathbf{x}}) = p(\mathbf{w}\vert\alpha)p(\mathbf{t}\vert\mathbf{w},\mathbf{X},\beta)p(\hat{t}\vert\hat{\mathbf{x}},\mathbf{w},\beta) \]

Conditioning on \(\mathbf{t}\) makes the posterior over \(\mathbf{w}\) appear:

(48)#\[ p(\hat{t},\mathbf{w}\vert\mathbf{t},\alpha,\beta,\hat{\mathbf{x}}) = p(\mathbf{w}\vert\mathbf{t},\alpha,\beta)p(\hat{t}\vert\mathbf{w},\hat{\mathbf{x}},\beta) \]

And finally we can get the distribution over \(\hat{t}\) by marginalizing \(\mathbf{w}\) out:

(49)#\[ p(\hat{t}\vert\hat{\mathbf{x}},\beta) = \displaystyle\int p(\mathbf{w}\vert\mathbf{t},\alpha,\beta)p(\hat{t}\vert\mathbf{w},\hat{\mathbf{x}},\beta) \,\mathrm{d}\mathbf{w} \]

Considering for a moment only \(\mathbf{w}\) and \(\hat{t}\), we now have a joint Gaussian with a marginal \(p(\mathbf{w}\vert\mathbf{t})\) and a conditional \(p(\hat{t}\vert\mathbf{w})\) and we want the marginal \(p(\hat{t})\). Using again the standard expressions, we get:

(50)#\[ p(\hat{t}\vert\hat{\mathbf{x}},\beta) = \displaystyle\mathcal{N}\left( \hat{t}\vert \mathbf{m}^\mathrm{T}\boldsymbol{\phi}(\hat{\mathbf{x}}), \frac{1}{\beta} + \boldsymbol{\phi}(\hat{\mathbf{x}})^\mathrm{T}\mathbf{S}\boldsymbol{\phi}(\hat{\mathbf{x}}) \right) \]

where \(\mathbf{m}\) and \(\mathbf{S}\) come directly from Eqs. (45) and (46).

Further Reading

Read Section 3.3.2 for more details on this predictive distribution. Study what the two terms of the predictive variance represent, and what happens to them in Figure 3.8.

bishop-prml

Click below to see three different radial basis function models fitted to the same dataset using the three strategies we discussed.

../../../../../_images/e06ba5708813efce9c31ce2334e7599cbe4a426e97e756fd6698a5cd5159dcd3.png
Model
  • Unregularized least squares fit

  • No uncertainty information

Prediction quality
  • Overfits noise in data

  • Response explodes in extrapolation

../../../../../_images/623d13039ab6102b4af9fbd6b106f433568374f1ee6dded7c90dab6315f3b380.png
Model
  • Regularized (\(L_2\)) least squares fit

  • No uncertainty information

Prediction quality
  • Regularization prevents overfitting

  • Behaves well away from observations

../../../../../_images/40cba499471d52a55e4356f5f7971653242b8a4951dbb1ffacba2f84858c9773.png
Model
  • Consistent Bayesian fit

  • Provides uncertainty information

Prediction quality
  • Prior distribution prevents overfitting

  • Variance increases away from data