Your very first models#

This notebook contains three interactive exercises for understanding kNN models, underfitting and overfitting.

# pip install packages that are not in Pyodide
%pip install ipympl==0.9.3
%pip install seaborn==0.12.2

# Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from mude_tools import magicplotter
from cycler import cycler
import seaborn as sns

%matplotlib widget

# Set the color scheme
sns.set_theme()
colors = [
    "#0076C2",
    "#EC6842",
    "#A50034",
    "#009B77",
    "#FFB81C",
    "#E03C31",
    "#6CC24A",
    "#EF60A3",
    "#0C2340",
    "#00B8C8",
    "#6F1D77",
]
plt.rcParams["axes.prop_cycle"] = cycler(color=colors)
# The true function relating t to x
def f_truth(x, freq=1, **kwargs):
    # Return a sine with a frequency of freq
    return np.sin(x * freq) + x
    # return 3. * np.exp(x) / (np.exp(x)+1)


# The data generation function
def f_data(epsilon=0.7, N=100, **kwargs):
    # Apply a seed if one is given
    if "seed" in kwargs:
        np.random.seed(kwargs["seed"])

    # Get the minimum and maximum
    xmin = kwargs.get("xmin", 0)
    xmax = kwargs.get("xmax", 2 * np.pi)

    # Generate N evenly spaced observation locations
    x = np.linspace(xmin, xmax, N)

    # Generate N noisy observations (1 at each location)
    t = f_truth(x, **kwargs) + np.random.normal(0, epsilon, N)

    # Return both the locations and the observations
    return x, t


# Get the observed data
x, t = f_data()
# Define the prediction locations
# (note that these are different from the locations where we observed our data)
x_pred = np.linspace(0, 2 * np.pi, 1000)


# Define a function that makes a KNN prediction at the given locations, based on the given (x,t) data
def KNN(x, t, x_pred, k=1, **kwargs):
    # Convert x and x_pred to a column vector in order for KNeighborsRegresser to work
    X = x.reshape(-1, 1)
    X_pred = x_pred.reshape(-1, 1)

    # Train the KNN based on the given (x,t) data
    neigh = KNeighborsRegressor(k)
    neigh.fit(X, t)

    # Make a prediction at the locations given by x_pred
    y = neigh.predict(X_pred)

    # Check if the regressor itself should be returned
    if kwargs.get("return_regressor", False):
        # If so, return the fitted KNN regressor
        return neigh

    else:
        # If not, return the predicted values
        return y
def get_plot1():
    plot1 = magicplotter(f_data, f_truth, KNN, x_pred, freq=3.0, epsilon=1.0)
    plot1.fig.canvas.toolbar_visible = False
    plot1.add_slider("k", valmin=1, valmax=50, valinit=50)
    # plot1.add_slider("N", valmin=1, valmax=100, valinit=70)
    plot1.ax.lines[0].remove()
    plot1.show()
    return plot1


def get_plot2():
    plot2 = magicplotter(f_data, f_truth, KNN, x_pred, freq=3.0, epsilon=1.0)
    plot2.fig.canvas.toolbar_visible = False
    plot2.add_slider("k", valmin=1, valmax=50, valinit=1)
    plot2.show()
    return plot2


def get_plot3():
    plot3 = magicplotter(f_data, f_truth, KNN, x_pred, freq=3.0, epsilon=1.0)
    plot3.fig.canvas.toolbar_visible = False
    plot3.add_slider("k", valmin=1, valmax=50, valinit=1)
    plot3.add_sidebar()
    plot3.ax_mse.set_ylim((-0.05, 2.0))
    plot3.ax.lines[0].remove()
    plot3.show()
    return plot3

Training a first model#

Play with the interactive model below and try to build the best possible model given our \(N\) data points. Remember we are trying to minimize the error when making predictions!

plot1 = get_plot1()

Training a model while knowing the ground truth#

We repeat the exercise from before but now including the exact function we are trying to approximate (ground truth). Try it out below. Did your choice of \(k\) change?

plot2 = get_plot2()

Training a good model under limited knowledge#

Of course, in practice the ground truth is unknown! We only have our data to work with, and nothing else. Below we are using 80% of our dataset for training and 20% for validation. Try to build a model with as low validation error as possible. Then compare with the two models you trained above.

plot3 = get_plot3()