In [None]:
# NECESSARY CELL TO REMOVE THE DOWNLOAD AND EXECUTE BUTTONS FROM THE PAGE 

# Active Learning

The Bayesian approach of weighing prior beliefs and observations lends itself well to situations in which a complete dataset is not available from the start but data is instead coming in gradually in a sequential manner:

:::{card} **Batch learning**

A dataset $\mathcal{D}$ with several observations is already available before training. Bayes' Theorem is used to go directly from the no-data initial prior to the final posterior distribution.
:::

:::{card} **Active learning**

We start with an empty dataset and only the initial prior distribution. A posterior is computed when new data comes in and this posterior becomes the prior for the next update. This is repeated until all the data has been observed.
:::

Here we demonstrate this approach with a couple of examples. As {doc}`before<linear_models>` we start from a prior over our parameters:

$$
p(\mathbf{w}) = \mathcal{N}\left(\mathbf{w}\vert\boldsymbol{0},\alpha^{-1}\mathbf{I}\right)
$$(prior2)

Using the same conditioning approach as before, we get for the first data point:

$$
p(\mathbf{w}\vert\mathbf{t}) = \mathcal{N}\left(\mathbf{w}\vert\mathbf{m}_1,\mathbf{S}_1\right)
$$(posterior2)

$$
\mathbf{m}_1 = \beta\mathbf{S}_1\boldsymbol{\phi}(\mathbf{x}_1)^\mathrm{T}t_1
$$(postmean2)

$$
\mathbf{S}_1^{-1} = \alpha\mathbf{I} + \beta\boldsymbol{\phi}(\mathbf{x}_1)^\mathrm{T}\boldsymbol{\phi}(\mathbf{x}_1)
$$(postvar2)

Note that we now only compute the basis functions for a single input vector $\mathbf{x}_1$ and condition on only a single target $t_1$ (it was a vector in Eq. {eq}`postmean`).

To observe a second data point we just use Eq. {eq}`posterior2` as **our new prior** and repeat the process. Using the {ref}`standard expressions<bayes-stdexpressions>` from before we get:

$$
p(\mathbf{w}\vert\mathbf{t}) = \mathcal{N}\left(\mathbf{w}\vert\mathbf{m}_2,\mathbf{S}_2\right)
$$(posterior3)

$$
\mathbf{m}_2 = \mathbf{S}_2\left(\mathbf{S}_1^{-1}\mathbf{m}_1 + \beta\boldsymbol{\phi}(\mathbf{x}_2)^\mathrm{T}t_2\right)
$$(postmean3)

$$
\mathbf{S}_2^{-1} = \mathbf{S}_1^{-1} + \beta\boldsymbol{\phi}(\mathbf{x}_2)^\mathrm{T}\boldsymbol{\phi}(\mathbf{x}_2)
$$(postvar3)

and recalling that $\mathbf{m}_0=\boldsymbol{0}$ in Eq. {eq}`prior2`, we see that we have exactly the same expressions for the second update but with the first posterior acting as the new prior.

The above can be generalized as:

$$
p(\mathbf{w}\vert\mathbf{t}) = \mathcal{N}\left(\mathbf{w}\vert\mathbf{m}_\mathrm{new},\mathbf{S}_\mathrm{new}\right)
$$(posteriorN)

$$
\mathbf{m}_\mathrm{new} = \mathbf{S}_\mathrm{new}\left(\mathbf{S}_\mathrm{old}^{-1}\mathbf{m}_\mathrm{old} + \beta\boldsymbol{\phi}(\mathbf{x}_\mathrm{new})^\mathrm{T}t_\mathrm{new}\right)
$$(postmeanN)

$$
\mathbf{S}_\mathrm{new}^{-1} = \mathbf{S}_\mathrm{old}^{-1} + \beta\boldsymbol{\phi}(\mathbf{x}_\mathrm{new})^\mathrm{T}\boldsymbol{\phi}(\mathbf{x}_\mathrm{new})
$$(postvarN)

and observing one point at a time is not strictly necessary, we could also observe data in chunks and the same expressions would hold as long as we arrange $\mathbf{t}$ and $\boldsymbol{\Phi}$ in their proper vector/matrix forms.

**Click through the tabs below** to see an example of this procedure. We start with a prior model with Radial Basis Functions and observe one data point at a time. On the left plots you can see 10 sets of weights sampled from our prior/posterior distribution and the corresponding predictions they give. This is a nice feature of the Bayesian approach: we do not end up with a single trained model but with a bag of models we can draw from.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from myst_nb import glue
from matplotlib.lines import Line2D
from cycler import cycler
import seaborn as sns

# Set the color scheme
sns.set_theme()
colors = ['#0076C2', '#EC6842', '#A50034', '#009B77', '#FFB81C', '#E03C31', '#6CC24A', '#EF60A3', '#0C2340', '#00B8C8', '#6F1D77']
plt.rcParams['axes.prop_cycle'] = cycler(color=colors)

M = 9
s = 2.*np.pi/float(M)
alpha = 5.0
beta = 100.0

def f (x):
    return np.sin(x)

def bayes(m_prior,L_prior,xval,x=None,y=None):
    centers = np.linspace(0.,2.*np.pi,M)

    if x:
        n = len(x)

        gram = np.zeros((n,M+1))
        gram[:,0] = 1.0

        for i in range(n):
            for j in range(M):
                gram[i,j+1] = np.exp(-(x[i]-centers[j])**2.0/2.0/s/s)

        L_post =  L_prior + beta * gram.transpose() @ gram

        S_post = np.linalg.inv(L_post)

        m_post = S_post @ (L_prior @ m_prior + beta * gram.transpose() @ y)
    else:
        m_post = m_prior
        L_post = L_prior
        S_post = np.linalg.inv(L_prior)
                  
    yval = np.zeros(len(xval))
    yvar = np.zeros(len(xval))
    
    for i in range(len(xval)):
        features = np.zeros(M+1)
        features[0] = 1.0
        for j in range(M):
            features[j+1] = np.exp(-(xval[i]-centers[j])**2.0/2.0/s/s)
        yval[i] = np.inner(features,m_post)
        yvar[i] = 1./beta + np.inner(features,np.matmul(S_post,features))

    return m_post, L_post, yval, yvar

def sample(m,L,xval):
    centers = np.linspace(0.,2.*np.pi,M)
    
    S = np.linalg.inv(L)
    
    w = np.random.multivariate_normal(m,S)
   
    yval = np.zeros(len(xval))
    yvar = np.zeros(len(xval))

    for i in range(len(xval)):
        features = np.zeros(M+1)
        features[0] = 1.0
        for j in range(M):
            features[j+1] = np.exp(-(xval[i]-centers[j])**2.0/2.0/s/s)
        yval[i] = np.inner(features,w)

    return yval

N = 5
n_samples = 10

np.random.seed(101230)

xval = np.linspace(0,2.*np.pi,1000)

prior_mean = np.zeros(M+1)
prior_prec = alpha*np.eye(M+1)

_,_,y_map, y_var = bayes(prior_mean,prior_prec,xval)

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(8,3),dpi=400)
ax1.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
ax1.set_xlabel('x')
ax1.set_ylabel('t')

handles, labels = ax1.get_legend_handles_labels()
line = Line2D([0], [0], color='gray', lw=1)
handles.append(line)
labels.append('y(x) samples')
ax1.legend(handles=handles, labels=labels, fontsize=7, loc='upper right')
ax1.set_ylim([-1.7, 1.7])

for i in range(n_samples):
    y_sample = sample(prior_mean,prior_prec,xval)
    ax1.plot(xval,y_sample,linewidth=1)

ax2.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
ax2.plot(xval,y_map,label='Predictive mean',color='C1')
ax2.set_xlabel('x')
ax2.fill_between(xval,y_map-1.96*np.sqrt(y_var),y_map+1.96*np.sqrt(y_var),alpha=0.3,
                 label='95% conf. interval',color='C1')
ax2.legend(fontsize=7, loc='upper right')
ax2.set_ylim([-1.7, 1.7])

glue("fig0", fig, display=False)

x = np.random.uniform(0,2*np.pi,N)
y = f(x) + np.random.normal(0.,np.sqrt(1./beta),N)

for i in range(len(x)):
    prior_mean, prior_prec, y_map, y_var = bayes(prior_mean,prior_prec,xval,[x[i]],[y[i]])
    
    fig, (ax1, ax2) = plt.subplots(1,2,figsize=(8,3),dpi=400)
    
    ax1.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
    ax1.set_xlabel('x')
    ax1.set_ylabel('t')
    
    for j in range(n_samples):
        y_sample = sample(prior_mean,prior_prec,xval)
        ax1.plot(xval,y_sample,linewidth=1)

    ax1.plot(x[:i+1],y[:i+1],'k.',markersize=10,label='Observations')
    
    handles, labels = ax1.get_legend_handles_labels()
    line = Line2D([0], [0], color='gray', lw=1)
    handles.append(line)
    labels.append('y(x) samples')
    ax1.legend(handles=handles, labels=labels, fontsize=7, loc='upper right')
    ax1.set_ylim([-1.7, 1.7])
    
    ax2.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
    ax2.plot(xval,y_map,label='Predictive mean',color='C1')
    ax2.plot(x[:i+1],y[:i+1],'k.',markersize=10,label='Observations')
    ax2.set_xlabel('x')
    ax2.fill_between(xval,y_map-1.96*np.sqrt(y_var),y_map+1.96*np.sqrt(y_var),alpha=0.3,
                     label='95% conf. interval', color='C1')
    ax2.legend(fontsize=7, loc='upper right')
    ax2.set_ylim([-1.7, 1.7])
    
    glue("fig"+str(i+1), fig, display=False)



`````{tab-set}
````{tab-item} Prior

```{glue:figure} fig0
:figwidth: 750px

Model behavior under only our prior assumptions
```

````

````{tab-item} 1 data point

```{glue:figure} fig1
:figwidth: 750px

Bayesian fit with one observation
```

````

````{tab-item} 2 data points

```{glue:figure} fig2
:figwidth: 750px

Bayesian fit with two observations
```

````

````{tab-item} 3 data points

```{glue:figure} fig3
:figwidth: 750px

Bayesian fit with three observations
```

````

````{tab-item} 4 data points

```{glue:figure} fig4
:figwidth: 750px

Bayesian fit with four observations
```

````

````{tab-item} 5 data points

```{glue:figure} fig5
:figwidth: 750px

Bayesian fit with five observations
```

````

`````

What can you observe from the results above? Note how models sampled from the initial prior are quite uninformed. As soon as some data is observed, the **posterior** space of possible models becomes more and more constrained to agree with the observed points. Note also that instead of drawing models from our posterior we can conveniently just look at the predictive mean and variance on the right-hand plots. This already conveys enough information since our all our distributions are Gaussian.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from cycler import cycler
from myst_nb import glue
from matplotlib.lines import Line2D
from matplotlib.ticker import FormatStrFormatter
from cycler import cycler
import seaborn as sns

# Set the color scheme
sns.set_theme()
colors = ['#0076C2', '#EC6842', '#A50034', '#009B77', '#FFB81C', '#E03C31', '#6CC24A', '#EF60A3', '#0C2340', '#00B8C8', '#6F1D77']
plt.rcParams['axes.prop_cycle'] = cycler(color=colors)

alpha = 100.0
beta = 40.0

def f (x):
    return x

def bayes(m_prior,L_prior,xval,x=None,y=None):
    if x:
        n = len(x)

        gram = np.zeros((n,1))
        
        gram[:,0] = x

        L_post =  L_prior + beta * gram.transpose() @ gram

        S_post = np.linalg.inv(L_post)

        m_post = S_post @ (L_prior @ m_prior + beta * gram.transpose() @ y)
    else:
        m_post = m_prior
        L_post = L_prior
        S_post = np.linalg.inv(L_prior)
                  
    yval = np.zeros(len(xval))
    yvar = np.zeros(len(xval))
    
    for i in range(len(xval)):
        features = np.zeros(1)
        features[0] = xval[i]
        yval[i] = np.inner(features,m_post)
        yvar[i] = 1./beta + np.inner(features,np.matmul(S_post,features))

    return m_post, L_post, yval, yvar

def sample(m,L,xval):   
    S = np.linalg.inv(L)
    
    w = np.random.multivariate_normal(m,S)
   
    yval = np.zeros(len(xval))
    yvar = np.zeros(len(xval))

    for i in range(len(xval)):
        features = np.zeros(1)
        features[0] = xval[i]

        yval[i] = np.inner(features,w)

    return yval

def gauss(m,L):
    var = np.linalg.inv(L)[0,0]
    mean = m[0]
    
    x = np.linspace(-0.5,1.1,1000)
    
    p = 1. / np.sqrt(2.*np.pi*var) * np.exp(-1./2./var * (x - mean)**2.)
    
    return x,p

def likelihood(x,t):
    var = beta
    # mean = w*x
    
    w = np.linspace(-10,10,1000)
    
    p = 1. / np.sqrt(2.*np.pi*var) * np.exp(-1./2./var * (t - w*x)**2.)
    
    return w,p
   
N = 5
n_samples = 10

np.random.seed(10)
np.random.seed(34534510)
np.random.seed(34510)


xval = np.linspace(0,2.*np.pi,1000)

prior_mean = np.zeros(1)
prior_prec = alpha*np.eye(1)

_,_,y_map, y_var = bayes(prior_mean,prior_prec,xval)

fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2,figsize=(10,7),dpi=400)
ax1.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
ax1.set_xlabel('x')
ax1.set_ylabel('t')

handles, labels = ax1.get_legend_handles_labels()
line = Line2D([0], [0], color='gray', lw=1)
handles.append(line)
labels.append('Weight samples')
ax1.legend(handles=handles, labels=labels, fontsize=7, loc='upper left')
ax1.set_ylim([-1,6.9])

for i in range(n_samples):
    y_sample = sample(prior_mean,prior_prec,xval)
    ax1.plot(xval,y_sample,linewidth=1)

ax2.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
ax2.plot(xval,y_map,label='Predictive mean',color='C1')
ax2.set_xlabel('x')
ax2.set_ylabel('t')
ax2.fill_between(xval,y_map-1.96*np.sqrt(y_var),y_map+1.96*np.sqrt(y_var),alpha=0.3,
                 label='95% conf. interval',color='C1')
ax2.legend(fontsize=7, loc='upper left')
ax2.set_ylim([-1,6.9])

x_gauss, y_gauss = gauss(prior_mean,prior_prec)
ax3.plot(x_gauss,y_gauss)
ax3.set_xlabel('w')
ax3.set_ylabel('prior p(w)')
ax3.set_ylim([-.5,17.5])

ax4.set_xlabel('w')
ax4.set_ylabel('p(t|w)')
ax4.text(0.5,0.5,'No data',fontsize=14,horizontalalignment='center',verticalalignment='center')

glue("fig6", fig, display=False)

x = np.random.uniform(0,2*np.pi,N)
y = f(x) + np.random.normal(0.,np.sqrt(1./beta),N)

for i in range(len(x)):
    w_gauss_prior, p_gauss_prior = gauss(prior_mean,prior_prec)
    
    prior_mean, prior_prec, y_map, y_var = bayes(prior_mean,prior_prec,xval,[x[i]],[y[i]])
    
    fig, ((ax1, ax2),(ax3,ax4)) = plt.subplots(2,2,figsize=(10,7),dpi=400)
    
    ax1.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
    ax1.set_xlabel('x')
    ax1.set_ylabel('t')
    
    for j in range(n_samples):
        y_sample = sample(prior_mean,prior_prec,xval)
        ax1.plot(xval,y_sample,linewidth=1)

    ax1.plot(x[:i+1],y[:i+1],'k.',markersize=10,label='Observations')
    
    handles, labels = ax1.get_legend_handles_labels()
    line = Line2D([0], [0], color='gray', lw=1)
    handles.append(line)
    labels.append('Weight samples')
    ax1.legend(handles=handles, labels=labels, fontsize=7, loc='upper left')
    ax1.set_ylim([-1,6.9])
    
    ax2.plot(xval,f(xval),'k--',label='Ground truth',linewidth=1)
    ax2.plot(xval,y_map,label='Predictive mean',color='C1')
    ax2.plot(x[:i+1],y[:i+1],'k.',markersize=10,label='Observations')
    ax2.set_xlabel('x')
    ax2.set_ylabel('t')
    ax2.fill_between(xval,y_map-1.96*np.sqrt(y_var),y_map+1.96*np.sqrt(y_var),alpha=0.3,
                     label='95% conf. interval',color='C1')
    ax2.legend(fontsize=7, loc='upper left')
    ax2.set_ylim([-1,6.9])
    
    w_gauss_post, p_gauss_post = gauss(prior_mean,prior_prec)
    ax3.plot(w_gauss_post,p_gauss_post,label='Posterior')
    ax3.plot(w_gauss_prior,p_gauss_prior,label='Prior',alpha=0.5)
    ax3.set_xlabel('w')
    ax3.set_ylabel('prior/posterior p(w)')
    ax3.legend(loc='upper left')
    ax3.set_ylim([-.5,17.5])
    
    w, l = likelihood(x[i],y[i])
    ax4.plot(w,l,label='Likelihood')
    ax4.set_xlabel('w')
    ax4.set_ylabel('p(t|w)')
    ax4.yaxis.set_major_formatter(FormatStrFormatter('%.2f'))
    ax4.legend(loc='upper left')
    ax4.set_ylim([-.002,.068])
    
    glue("fig"+str(i+7), fig, display=False)

## A deeper look

The figures above show what happens with our final model as we observe data. They however only give an indirect idea of how $p(\mathbf{w}\vert\mathbf{t}$) is changing as more data is added. Since the functions above are of the form $y = \mathbf{w}^\mathrm{T}\boldsymbol{\phi}(\mathbf{x})$ with 10 weights, it is difficult to visualize their joint probability distribution in 10 dimensions. 

To make that visible, the figures below show a simple linear model with **a single weight** (the intercept is fixed at zero):

$$
p(t\vert w,\beta) = \mathcal{N}(t\vert wx,\beta^{-1}) \quad\quad p(w\vert\alpha) = \mathcal{N}(w\vert 0,\alpha^{-1})
$$(1dbasisfuncmodel)

which in this case means $\boldsymbol{\phi}(\mathbf{x})=[x]$. Again we start with no observations and fix $\alpha=100$ and $\beta=40$. Click through the tabs below to see how training evolves as more data becomes available:

`````{tab-set}
````{tab-item} Prior

```{glue:figure} fig6
:figwidth: 750px

Model behavior under only our prior assumptions
```

````

````{tab-item} 1 data point

```{glue:figure} fig7
:figwidth: 750px

Bayesian fit with one observation
```

````

````{tab-item} 2 data points

```{glue:figure} fig8
:figwidth: 750px

Bayesian fit with two observations
```

````

````{tab-item} 3 data points

```{glue:figure} fig9
:figwidth: 750px

Bayesian fit with three observations
```

````

````{tab-item} 4 data points

```{glue:figure} fig10
:figwidth: 750px

Bayesian fit with four observations
```

````

````{tab-item} 5 data points

```{glue:figure} fig11
:figwidth: 750px

Bayesian fit with five observations
```

````

`````

The figures on the top row should look familiar. We again start with an uninformed bag of models and they evolve to a more constrained version as more data is observed. The bottom row shows some new insights. On the left we see the actual probability distribution $p(w)$, either prior or posterior. 

On the right we see a plot of the likelihood function $p(t\vert w)$. Recall this is a distribution on $t$, and therefore **not** on $w$! When plotting it against $w$ (on which it does depend), we call it **likelihood function** instead of probability distribution to make the distinction clear. The likelihood function provides a measure of how likely different values of $w$ would make the observation of the data point currently being assimilated. It therefore provides a push towards certain values of $w$ which is weighed against the current prior $p(w)$.

Note how observing the first point has little effect on the posterior: it is so close to the origin that the likelihood function becomes quite spread and moves the posterior very little. We can read this as *"our current observation might just as well be explained with our observation noise $\beta$ regardless of what $w$ is"*. Observing subsequent points gradually moves the posterior towards the ground truth value $w_\mathrm{true}=1$ and the distribution becomes more highly peaked, as we would expect. 

At the limit of infinite observations: 

$$
N\to\infty\quad\Rightarrow\quad m_\infty\to w_\mathrm{MLE},\quad S_\infty\to 0
$$(infinitedatalimit)

This is a very satisfying result: when evidence is absolutely overwhelming, our prior beliefs should be completely discarded and we should just rely purely on what the data says.

```{admonition} Further Reading    
:class: tip    
You can now finish reading Section 3.3.1. Figure 3.7 contains a two-dimensional version of the example above which you can relate to what you have seen here.
+++                         
{bdg-danger}`bishop-prml`     
``` 