The Gaussian Distribution#

There is a plethora of different probability distributions to choose from when describing discrete and continuous random variables. When linking random variables together in Graph Models, these distributions become building blocks for more complicated probabilistic models.

In this page we give a very brief overview of the Gaussian distribution, which will appear throughout this part of the course content.

# pip install packages that are not in Pyodide
%pip install ipympl==0.9.3
%pip install seaborn==0.12.2

# Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import axes
from matplotlib.pyplot import Slider
from cycler import cycler
import seaborn as sns
import scipy as sp
from scipy.stats import multivariate_normal, wishart, norm
%matplotlib widget

# Set the color scheme
sns.set_theme()
colors = ['#0076C2', '#EC6842', '#A50034', '#009B77', '#FFB81C', '#E03C31', '#6CC24A', '#EF60A3', '#0C2340', '#00B8C8', '#6F1D77']
plt.rcParams['image.cmap'] = sns.color_palette("rocket_r", as_cmap=True)
plt.rcParams['axes.prop_cycle'] = cycler(color=colors)

The Gaussian is arguably the most widely used probability distribution. It is also one of the main building blocks of a large number of machine learning models and we must therefore give it special attention.

For a scalar variable \(x\), the Gaussian probability density takes the form:

(18)#\[ \mathcal{N}(x\vert\mu,\sigma^2) = \displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[-\displaystyle\frac{1}{2\sigma^2}\left(x-\mu\right)^2\right] \]

where we use the special symbol \(\mathcal{N}\) to denote a Gaussian. We can see that the distribution is completely determined by a mean \(\mu\) and a variance \(\sigma^2\). For a vector \(\mathbf{x}\) of \(D\) variables we can generalize a multivariate Gaussian as:

(19)#\[ \mathcal{N}(\mathbf{x}\vert\boldsymbol{\mu},\boldsymbol{\Sigma}) = \displaystyle\frac{1}{(2\pi)^{D/2}}\frac{1}{\sqrt{\vert\boldsymbol{\Sigma}\vert}}\exp\left[-\displaystyle\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}\right)^\mathrm{T}\boldsymbol{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}\right)\right] \]

where the mean is now a vector \(\boldsymbol{\mu}\) of size \(D\) and the variance is now a matrix \(\boldsymbol{\Sigma}\) of size \(D\times D\).

The Gaussian is based on a quadratic form, so it has ellipsoidal isoprobability contours in \(D\)-dimensional space that depend on what goes into \(\boldsymbol{\Sigma}\). The contents of the covariance matrix also encode how correlated the different variables in the distribution are.

In the following we see some important results for the Gaussian we will keep using throughout the course.

Conditioning#

Consider a multivariate Gaussian for a vector variable \(\mathbf{x}\). Imagine we split this vector into two groups \(\mathbf{x}_a\) and \(\mathbf{x}_b\) of variables:

(20)#\[\begin{split} \mathbf{x} = \begin{bmatrix}\mathbf{x}_a\\\mathbf{x}_b\end{bmatrix} \quad\Rightarrow\quad p(\mathbf{x}) = \mathcal{N}\left( \begin{bmatrix}\mathbf{x}_a\\\mathbf{x}_b\end{bmatrix} \Bigg\vert \begin{bmatrix}\boldsymbol{\mu}_a\\\boldsymbol{\mu}_b\end{bmatrix} , \begin{bmatrix}\boldsymbol{\Sigma}_{aa} & \boldsymbol{\Sigma}_{ab}\\\boldsymbol{\Sigma}_{ba} & \boldsymbol{\Sigma}_{bb}\end{bmatrix} \right) \end{split}\]

where we have just split our mean \(\boldsymbol{\mu}\) and covariance \(\boldsymbol{\Sigma}\) accordingly.

A nice property of the Gaussian distribution is that the distribution of part of our variables conditioned on the other part is still a Gaussian. More specifically, we can arrive at the expressions:

(21)#\[ p(\mathbf{x}_a\vert\mathbf{x}_b) = \mathcal{N}\left(\mathbf{x}_a\vert\boldsymbol{\mu}_{a\vert b},\boldsymbol{\Sigma}_{a\vert b}\right) \]
(22)#\[ \boldsymbol{\mu}_{a\vert b} = \boldsymbol{\mu}_a + \boldsymbol{\Sigma}_{ab}\boldsymbol{\Sigma}_{bb}^{-1}\left(\mathbf{x}_b-\boldsymbol{\mu}_b\right) \]
(23)#\[ \boldsymbol{\Sigma}_{a\vert b} = \boldsymbol{\Sigma}_{aa} - \boldsymbol{\Sigma}_{ab}\boldsymbol{\Sigma}_{bb}^{-1}\boldsymbol{\Sigma}_{ba} \]

with a mean that is linear in \(\mathbf{x}_b\) and covariance that is independent of \(\mathbf{x}_b\).

Marginalization#

Considering the same partitioning as above, we might also be interested in the marginal distribution of only part of our variables. Starting from the joint \(p(\mathbf{x})\) marginalizing \(\mathbf{x}_b\) means integrating over it (from the Sum Rule in Probability Theory):

(24)#\[ p(\mathbf{x}_a) = \displaystyle\int p(\mathbf{x}_a,\mathbf{x}_b)\,\mathrm{d}\mathbf{x}_b \]

Working out the math leads to:

(25)#\[ p(\mathbf{x}_a) = \mathcal{N}\left(\mathbf{x}_a\vert\boldsymbol{\mu}_a,\boldsymbol{\Sigma}_{aa}\right) \]

which is a simple and intuitive result.

class plot_multivariate_normal:
    
    def __init__(self, mu: np.ndarray = None, Sigma: np.ndarray = None, x_b: np.ndarray = None,
                 rng: np.random.Generator = None):
        
        self.D = 2
        self.M = 1
        
        if rng is None:
            rng = np.random.default_rng(2)
            
        if mu is None:
            self.mu = norm().rvs(size=self.D, random_state=rng)
        else:
            assert mu.shape[0] == self.D, 'mean should be 1D array of length 2'
            self.mu = mu
        
        if Sigma is None:
            self.Sigma = sp.stats.wishart(self.D, np.eye(self.D), seed=rng).rvs(random_state=rng)
            self.Sigma /= np.linalg.norm(self.Sigma)
        else:
            assert np.alltrue([Sigma.shape[0] == self.D,
                               Sigma.shape[1] == self.D]), 'Sigma should be 2x2 matrix'
            assert np.all(np.linalg.eigvals(Sigma) > 0.), 'matrix is not positive semi-definite'
            self.Sigma = Sigma
                          
        if x_b is None:
            self.x_b = norm().rvs(self.D - self.M, random_state=rng)
        else:
            assert x_b.shape[0] == self.M, 'x_b should be 1D array of length 1'
            self.x_b = x_b
            
        self.condition()
        
        
        # make a n x n grid to plot joint pdf
        self.n = n = 100
        self.lims = np.vstack((self.mu - 4., self.mu + 4.)).T
        self.x, self.y = np.linspace(*self.lims[0], self.n), np.linspace(*self.lims[1], self.n)
        self.grid_x, self.grid_y = np.meshgrid(self.x, self.y, indexing='xy')
        self.grid_flat = np.vstack((self.grid_x.flatten(), self.grid_y.flatten()))
        grid_p = np.exp(multivariate_normal(self.mu, self.Sigma).logpdf(self.grid_flat.T))
        self.levels = np.linspace(np.min(grid_p), np.max(grid_p), 15)[1:]
        
        # create figure
        self.fig, self.ax = plt.subplots(figsize=(6,6))
        
        # plot joint pdf and vline for xb
        self.fig.subplots_adjust(bottom=0.35, left=0.25, right=0.8)
        self.plot_cont = self.ax.contourf(self.grid_y.reshape(n,n), self.grid_x.reshape(n,n),
                                          grid_p.reshape(n,n), levels=self.levels)
        self.plot_vline = self.ax.axvline(self.x_b[0])
        self.ax.set_xlim(self.lims[1])
        self.ax.set_ylim(self.lims[0])
        self.ax.set_xlabel(r'$x_b$')
        
        # instert axis and plot p(xa|xb)
        self.ax_cond = self.ax.inset_axes([1.05, 0., 0.2, 1.], sharey=self.ax)
        p_a_given_b = np.exp(multivariate_normal(self.mu_cond,
                                                 np.sqrt(self.Sigma_cond)).logpdf(self.x))
        self.plot_cond = self.ax_cond.plot(p_a_given_b, self.x, color='C1')
        self.ax_cond.set_xlim([-0.1,1.0])
        self.ax_cond.set_xlabel(r'$p(x_a|x_b)$')
        plt.setp(self.ax_cond.get_yticklabels(), visible=False)
        
        # instert axis and plot p(xa)
        self.ax_marg = self.ax.inset_axes([-.25, 0., 0.2, 1.], sharey=self.ax)
        p_a = np.exp(multivariate_normal(self.mu[0],
                                         np.sqrt(self.Sigma[0,0])).logpdf(self.x))
        self.plot_marg = self.ax_marg.plot(p_a, self.x, color='C1')
        self.ax_marg.set_xlabel(r'$p(x_a)$')
        self.ax_marg.set_ylabel(r'$x_a$')
        self.ax_marg.set_xlim([-0.1,1.5])
        self.ax_marg.invert_xaxis()
        plt.setp(self.ax.get_yticklabels(), visible=False)
        
        # create new axes for sliders
        self.ax_xb   = self.fig.add_axes([0.25, 0.20, 0.55, 0.03])
        self.ax_caa  = self.fig.add_axes([0.25, 0.15, 0.55, 0.03])
        self.ax_cbb  = self.fig.add_axes([0.25, 0.10, 0.55, 0.03])
        self.ax_corr = self.fig.add_axes([0.25, 0.05, 0.55, 0.03])
        
        # set sliders
        self.xb_slider = Slider(ax=self.ax_xb,
                                label=r'$x_b$',
                                valmin=self.lims[1,0],
                                valmax=self.lims[1,1],
                                valinit=self.x_b[0],
                                valfmt='%.2f',)
        
        self.caa_slider = Slider(ax=self.ax_caa,
                                 label=r'$\Sigma_{aa}$',
                                 valmin=0.3,
                                 valmax=2.0,
                                 valinit=self.Sigma[0,0],
                                 valfmt='%.2f',
                                 color='C1')
        
        self.cbb_slider = Slider(ax=self.ax_cbb,
                                 label=r'$\Sigma_{bb}$',
                                 valmin=0.3,
                                 valmax=2.0,
                                 valinit=self.Sigma[1,1],
                                 valfmt='%.2f',
                                 color='C1')
        self.corr_slider = Slider(ax=self.ax_corr,
                                  label=r'$\rho_{ab}$',
                                  valmin=-0.95,
                                  valmax=0.95,
                                  valinit=self.Sigma[1,0]/np.sqrt(self.Sigma[0,0]*self.Sigma[1,1]),
                                  valfmt='%.2f',
                                  color='C1')
        
        # register the update function with each slider and show plot
        self.xb_slider.on_changed(self.update_plot)
        self.caa_slider.on_changed(self.update_plot)
        self.cbb_slider.on_changed(self.update_plot)
        self.corr_slider.on_changed(self.update_plot)
        
        plt.show()
        
    def condition(self):
        # function to obtain conditional mean and covariance
    
        # partition mu
        mu_a = self.mu[:self.M]
        mu_b = self.mu[self.M:]
    
        # partition Sigma
        Sigma_aa = self.Sigma[:self.M,:][:,:self.M]
        Sigma_ab = self.Sigma[:self.M,:][:,self.M:]
        Sigma_ba = Sigma_ab.T
        Sigma_bb = self.Sigma[self.M:,:][:,self.M:]
    
        # invert Sigma_bb
        Sigma_bb_inv = np.linalg.inv(Sigma_bb)
    
        # compute conditional mean and covariance
        self.mu_cond    = mu_a + Sigma_ab @ Sigma_bb_inv @ (self.x_b - mu_b)
        self.Sigma_cond = Sigma_aa - Sigma_ab @ Sigma_bb_inv @ Sigma_ba
        
    def update_plot(self, val):
            
        # collect x_b and Sigma from slider
        self.x_b = self.xb_slider.val
        self.Sigma[0,0] = self.caa_slider.val
        self.Sigma[1,1] = self.cbb_slider.val
        self.Sigma[1,0] = np.sqrt(self.Sigma[0,0]) * np.sqrt(self.Sigma[1,1]) * self.corr_slider.val
        self.Sigma[0,1] = self.Sigma[1,0]
        
        grid_p = np.exp(multivariate_normal(self.mu, self.Sigma).logpdf(self.grid_flat.T))
        levels = np.linspace(np.min(grid_p), np.max(grid_p), 15)[1:]
            
        n = self.n
        
        # remove and replot contours of joint
        for coll in self.plot_cont.collections: 
            self.ax.collections.remove(coll) 
        self.plot_cont = self.ax.contourf(self.grid_y.reshape(n,n), self.grid_x.reshape(n,n),
                                          grid_p.reshape(n,n), levels=levels)
        
        # redo conditioniong
        self.condition()
        
        # remove and replot vline
        self.plot_vline.remove()
        self.plot_vline = self.ax.axvline(self.x_b)
        
        # set new conditional pdf
        p_a_given_b = np.exp(multivariate_normal(self.mu_cond,
                                                 np.sqrt(self.Sigma_cond)).logpdf(self.x))
        self.plot_cond[0].set_xdata(p_a_given_b)
        
        # set new marginal pdf
        p_a = np.exp(multivariate_normal(self.mu[0],
                                         np.sqrt(self.Sigma[0,0])).logpdf(self.x))
        self.plot_marg[0].set_xdata(p_a)
        
        # draw figure
        self.fig.canvas.draw_idle()
    

You can play around with the plot below by dragging the slider for \(x_b\) to get a feel for the effect of marginalizion and conditioning on normally distributed random variables. You can also alter the covariance matrix by pulling the sliders for \(\Sigma_{aa}\), \(\Sigma_{bb}\), and \(\rho\). The parameter \(\rho\) is the correlation between \(x_a\) and \(x_b\), and is given by

(26)#\[ \rho_{x_a, x_b} = \operatorname{corr}(x_a, x_b) = \frac{\operatorname{cov}(x_a, x_b)}{\sigma_a \sigma_b} = \frac{\Sigma_{ab}}{\sqrt{\Sigma_{aa}\Sigma_{bb}}} \]

and can be interpreted as a normalized version of the covariance \(\Sigma_{ab}\). We chose this parameterization to ensure a positive semidefinite covariance matrix \(\boldsymbol{\Sigma}\). Click on the top right of the page and then on Live Code to enable the interactive content.

plot = plot_multivariate_normal()

Bayesian inversion#

The last important result has to do with Bayes’ Theorem. We show above how to get conditionals and marginals of partitioned Gaussians. Imagine we use those expressions to obtain the following graph model:

../../../../../_images/graph3.svg

Fig. 9 A two-node graph with linear-Gaussian relationships#

From the graph above we have the conditional \(p(\mathbf{y}\vert\mathbf{x})\) and the marginal \(p(\mathbf{x})\) and they are both Gaussians. But now that we observed \(\mathbf{y}\) we can see \(p(\mathbf{x})\) as a prior and \(p(\mathbf{y}\vert\mathbf{x})\) as a likelihood. This means we can use Bayes’ Theorem to get a posterior over \(\mathbf{x}\):

(27)#\[ p(\mathbf{x}\vert\mathbf{y}) = \displaystyle\frac{p(\mathbf{y}\vert\mathbf{x})p(\mathbf{x})}{p(\mathbf{y})} \]

From Fig. 9, our joint distribution is \(p(\mathbf{x},\mathbf{y})= p(\mathbf{x})p(\mathbf{y}\vert\mathbf{x})\). Since this is again a Gaussian, the distributions we are looking for are also Gaussian. This means we can use the conditioning and marginalization expressions from before to first compute the marginal over \(\mathbf{y}\):

(28)#\[ p(\mathbf{y}) = \mathcal{N}\left( \mathbf{y}\vert\mathbf{A}\boldsymbol{\mu}+\mathbf{b}, \mathbf{L}^{-1} + \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^\mathrm{T} \right) \]

and finally our posterior:

(29)#\[ p(\mathbf{x}\vert\mathbf{y}) = \mathcal{N}\left( \mathbf{x}\vert\boldsymbol{\Sigma}^*\left[\mathbf{A}^\mathrm{T}\mathbf{L}\left(\mathbf{y}-\mathbf{b}\right)+\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\right],\boldsymbol{\Sigma}^* \right) \]
(30)#\[ \boldsymbol{\Sigma}^* = \left(\boldsymbol{\Sigma}^{-1}+\mathbf{A}^\mathrm{T}\mathbf{L}\mathbf{A}\right)^{-1} \]

Further Reading

You can find a very insightful extended discussion on the Gaussian in Section 2.3. The most important relationships can already be found above, but you are encouraged to derive them by yourself while reading the complete discussion in 2.3.

bishop-prml

Fitting distributions to data#

The distributions above have parameters that can be tweaked, and this is of course how we can use distributions with the same functional forms to explain a very broad range of phenomena. For instance, a toin coss could be represented by a Bernoulli distribution with \(\mu=0.5\), while the same distribution was used in our discussion about Bayes’ Theorem with \(\mu=0.6\) as prior to our beliefs over how much a person studied.

We quickly show the two main ways to fit these parameters to data here.

The frequentist way#

In a frequentist approach we ask ourselves the question:

If I fix \(\mu\) and draw a set of samples from my model at random, how likely would it be that those samples are exactly equal to the data I am observing?

For example, given \(N\) observations \(x_n\) of \(x\), we can imagine these samples are drawn independently and identically distributed (i.i.d) from \(p(x\vert\mu)\) to compute a likelihood function:

(31)#\[ p(\mathcal{D}\vert\mu) = \displaystyle\prod^N_{n=1}\displaystyle\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[-\displaystyle\frac{1}{2\sigma^2}\left(x_n-\mu\right)^2\right] \]

The higher this function becomes, the more likely it is that we can draw our dataset from our current distribution (given \(\mu\)). To have a distribution that fits our data well, we are therefore incentivised to use the value of \(\mu\) that maximizes this likelihood. We refer to this as a Maximum Likelihood Estimation (MLE) approach. Taking the logarithm of the expression above and setting its derivative to zero we get:

(32)#\[ \mu_{\text{MLE}} = \displaystyle\frac{1}{N}\displaystyle\sum^N_{n=1}x_n \]

which gives the usual result that the mean of the Gaussian should be set to be the mean value of the observed dataset.

The Bayesian way#

In a Bayesian approach we do not deal with drawing datasets and looking at frequencies, but instead with prior and posterior beliefs over our parameters. We ask ourselves the question:

My prior experience says \(\mu\) should be within a certain range. After observing my dataset, how much does my belief change?

This means we first adopt a prior \(p(\mu)\) and after observing a dataset \(\mathcal{D}\) we compute the posterior:

(33)#\[ p(\mu\vert\mathcal{D}) = \displaystyle\frac{p(\mathcal{D}\vert\mu)p(\mu)}{p(\mathcal{D})} \]

which will then move the prior to a new range. Note that here we do not have a single value for \(\mu\) but a distribution. By observing more and more data, it stands to reason that the variance of \(p(\mu\vert\mathcal{D})\) should gradually decrease, until the observations give such overwhelming evidence that \(p(\mu)\) will be highly peaked around its true value.