Gaussian Processes

Gaussian Processes#

In the previous page we have found a way to convert a model based on weights \(\bw\) to an exactly equivalent model based on functions represented by multivariate Gaussian distributions with a covariance matrix based on a kernel function. The model we just saw is an example of a Gaussian Process, which we now define formally:

Gaussian Process (GP)

A Gaussian probability density over functions. In practice, this manifests as any arbitrary number of points in the function being jointly Gaussian.

Just as in any Gaussian, a GP can be fully described by its mean and covariance:

(72)#\[ y(\bx) \sim \mathcal{GP}\left(m(\bx),k(\bx,\bx')\right) \]

where \(m(\bx)\) is a mean function and \(k(\bx,\bx')\) is a kernel that defines the covariance between any two points \(\bx\) and \(\bx'\).

We can represent any number of values of \(y(\bx)\) as a probabilistic graph, as we have been doing. Because every value of \(y\) is correlated with every other value (through the covariance \(\mbf{K}\)), we can draw a few graphs as below:

../../../../../_images/gpgraph1.svg — Fig. 27 Examples of graphs with two, three or four fully-connected jointly-Gaussian \(\mbf{y}\) values.#

The joint densities for each of the three graphs above would be, respectively:

(73)#\[\begin{split} p(y_1,\cB{y_2}) = \gauss \left( \mbf{y}_{12}\left\vert\right. \begin{bmatrix} m(\bx_1)\\\cB{m(\bx_2)} \end{bmatrix} , \begin{bmatrix} k(\bx_1,\bx_1) & \cE{k(\bx_1,\bx_2)}\\ \cA{k(\bx_2,\bx_1)} & \cB{k(\bx_2,\bx_2)} \end{bmatrix} \right) \equiv \gauss\left( \mbf{y}_{12}\left\vert\right. \mbf{m}_{12},\mbf{K}_{12} \right) \end{split}\]

(74)#\[\begin{split} p(\mbf{y}_{12},\cB{y_3}) = \gauss \left( \mbf{y}_{123}\left\vert\right. \begin{bmatrix} \mbf{m}_{12}\\\cB{m(\bx_3)} \end{bmatrix} , \begin{bmatrix} \mbf{K}_{12} & \cE{\begin{matrix}k(\bx_1,\bx_3)\\k(\bx_2,\bx_3)\end{matrix}}\\ \cA{\begin{matrix}k(\bx_3,\bx_1)&k(\bx_3,\bx_2)\end{matrix}} & \cB{k(\bx_3,\bx_3)} \end{bmatrix} \right) \equiv \gauss\left( \mbf{y}_{123}\left\vert\right. \mbf{m}_{123},\mbf{K}_{123} \right) \end{split}\]

(75)#\[\begin{split} \small p(\mbf{y}_{123},\cB{y_4}) = \gauss \left( \mbf{y}_{1234}\left\vert\right. \begin{bmatrix} \mbf{m}_{123}\\\cB{m(\bx_4)} \end{bmatrix} , \begin{bmatrix} \mbf{K}_{123} & \cE{\begin{matrix}k(\bx_1,\bx_4)\\k(\bx_2,\bx_4)\\k(\bx_3,\bx_4)\end{matrix}}\\ \cA{\begin{matrix}k(\bx_4,\bx_1)&k(\bx_4,\bx_2)&k(\bx_4,\bx_3)\end{matrix}} & \cB{k(\bx_4,\bx_4)} \end{bmatrix} \right) \equiv \gauss\left( \mbf{y}_{1234}\left\vert\right. \mbf{m}_{1234},\mbf{K}_{1234} \right) \end{split}\]

Looking at the equations above, the pattern shoud become quite clear. Due to the simple marginalization result for Gaussian densities, any number of variables in the process can be described by themselves or in blocks. Adding a new variable to the mix simply means adding the colored submatrices you see above. Note also that the blue and yellow parts are the transpose of each other since Gaussians have symmetric covariances.

Since this leads to a fully connected set of an arbitrary number of \(\mbf{y}\) nodes, we generalize the graphs above:

../../../../../_images/gpgraph2.svg — Fig. 28 Generalization of the previous graphs, now representing a Gaussian Process with an arbitrary number \(N\) of values of function \(y(\bx)\).#

where the arrows connecting the \(y\) nodes have been changed to a thick line that represents full connectivity. Finally, becasue this graph has an arbitrary number of nodes, we can further simplify it to:

../../../../../_images/gpgraph3.svg — Fig. 29 Graph representation of a GP prior, with \(N\) jointly Gaussian values of \(y(\bx)\).#

which is the graph representing the prior over \(y(\bx)\), since nothing is observed yet. We will take care of that soon. Considering a prior with \(m(\bx)=0\) for simplicity (recall this is equivalent to a prior over \(\bw\) with zero mean), we can get to a general expression for the joint over two groups of function values \(\mbf{y}_a\) and \(\mbf{y}_b\):

(76)#\[\begin{split} p(\mbf{y}_a,\cB{\mbf{y}_b}) = \gauss\left( \mbf{y}\left\vert\right.\mbf{0}, \begin{bmatrix} K\left(\mbf{X}_a,\mbf{X}_a\right) & \cE{K\left(\mbf{X}_a,\mbf{X}_b\right)} \\ \cA{K\left(\mbf{X}_b,\mbf{X}_a\right)} & \cB{K\left(\mbf{X}_b,\mbf{X}_b\right)} \end{bmatrix} \right) \equiv \gauss\left( \mbf{y}\left\vert\right.\mbf{0}, \begin{bmatrix} \mbf{K}_{aa} & \cE{\mbf{K}_{ab}}\\ \cA{\mbf{K}_{ab}^\T} & \cB{\mbf{K}_{bb}} \end{bmatrix} \right) \end{split}\]

where \(\mbf{X}\) represents a stacking of \(\bx\) values and \(K\) is the matrix version of \(k(\bx,\bx')\).

Kernel engineering#

For basis function models we had \(m(\bx)=\mbf{0}\) and \(k(\bx,\bx')=\frac{1}{\alpha}\basis(\bx)^\T\basis(\bx')\). But why stop there? We can also make our own kernels, as long as the covariance for any set of function values is positive definite. This is called kernel engineering. Discovering new kernels with good properties for specific applications is an active field of research.

A very popular choice of kernel is the so-called squared exponential:

\[ k(\bx,\bx') = \sigma_f^2\exp\left(\displaystyle-\frac{1}{2\ell^2}\lVert\bx-\bx'\rVert^2\right) \]

where \(\sigma_f\) is a scaling factor and \(\ell\) is a length scale that controls the correlation between function values.

The kernel above is infinitely differentiable, and that is not always a desirable property. Another popular kernel that addresses this issue is the Matérn kernel:

\[ k(\bx,\bx') = \displaystyle\frac{\sigma_f^2}{\Gamma(\nu)2^{\nu-1}} \left( \frac{\sqrt{2\nu}}{\ell} \lVert\bx-\bx'\rVert \right)^\nu K_\nu \left( \frac{\sqrt{2\nu}}{\ell} \lVert\bx-\bx'\rVert \right) \]

where \(\nu\) controls the degree of differentiability of the functions drawn from the kernel, \(\Gamma(\cdot)\) is the Gamma function and \(K_\nu(\cdot)\) is a Modified Bessel function.

Click through the tabs below to have a feeling for different kernels. On the left you see a graphical representation of the covariance \(\mbf{K}\), with darker tones representing lower values and lighter tones higher values. On the right you see samples of the function \(y(x)\) drawn from Eq. (72) with \(m(x)=0\) and different kernels. Again note that we do not really sample a function from the process but simply pick 100 values of \(x\) in our range of interest, assemble a covariance \(\mbf{K}\) with entries given by the kernel and sample from the joint \(p(\mbf{y}\vert\mbf{0},\mbf{K})\).

Polynomials