Introduction and k-Nearest Neighbors

Introduction and k-Nearest Neighbors#

Introduction#

A very general problem often encountered is predicting some target variable \(t\) as a function of a predictor variable \(x\), but we only have inaccurate measurements of our target variable. Some examples of possible problems where we can apply this kind of framework are given below:

Problem	Target variable \(t\)	Predictor variable \(x\)
Structural health monitoring of a bridge	Displacement	Location along the bridge
Fatigue testing of a steel specimen	Stress	Number of loading cycles
Flood control of the river Maas	Volumetric flow rate	Precipitation in the Ardennes
Radioactive decay of a radon-222 sample	Number of \(\alpha\)-particles emitted	Time
Cooling rate of my coffee cup	Temperature	Time

To strip the problem down to its bare essentials, we will use a toy problem, where we actually know the ground truth \(f(x)\). In our case, this will be a simple sine wave, of which we make 100 noisy measurements from \(x=0\) to \(x=2 \pi\). Note that, in a practical setting, we would only have access to these noisy measurements and not to the true function that generated the data. Finding a good estimate of \(f(x)\) based on this contaminated data is one of our main objectives.

k-nearest neighbors#

A (perhaps naive) approach to find \(y(x)\) (i.e. the approximation of \(f(x)\)) would be to simply look at the surrounding data points and take their average to get an estimate of \(f(x)\). This approach is called k-nearest neighbors, where \(k\) refers to the number of surrounding points we are looking at.

Implementing this is not trivial, but thankfully we can leverage existing implementations. We will use the KNeighborsRegressor function from the sklearn.neighbors library to fit our data and get \(y(x)\) to make predictions.

We visualize the predictions with the help of the magicplotter object, which functions as a wrapper for matplotlib and enables us to reduce the amount of code contained in this page significantly. You do not need to understand how it works, but you can take a look at it via the course repository.

Looking at the previous plots, a few questions might pop up:

For \(k=1\), we see that our prediction matches the observed data exactly in all data points. Is this desirable?
For \(k=30\), what is going on around \(x=0\) and \(x=2 \pi\)?
For \(k=100\), why is our prediction constant with respect to \(x\)?

Varying our model parameters#

Clearly, some value of \(k\) between 1 and 100 would give us the best predictions. Using the script below, you can generate a plot where the following variables can be adjusted:

\(N\), the size of the training data set
\(\varepsilon\), the level of noise associated with the data
\(k\), the number of neighbors over which the average is taken
The seed can be updated to generate new random data sets
The oscillation frequency of the underlying ground truth model, which controls how nonlinear the approximation should be
A probing location \(x_0\), allowing you to see which neighbors are used to get the average response \(y(x_0)\) of the kNN estimator

Playing around with the plots#

By visual inspection, use the slider of \(k\) to find its optimal value. The following questions might be interesting to ask yourself:

If the training size \(N\) increases/decreases, how does this affect my optimal value of \(k\)?
If the frequency increases/decreases, how does this affect my optimal value of \(k\)?
If my measurements are less/more noisy, how does this affect my optimal value of \(k\)?
If I generate new data by changing the seed, how is my prediction affected for small values of \(k\)? What about large values of \(k\)?
So far, all observations were distributed uniformly over \(x\). How would our predictions change if our observed data was more clustered?
If I do not know the truth, how do I figure out what my value of \(k\) should be?

Final remarks#

So far, we have looked at our k-nearest neighbors regressor mostly qualitatively. However, it is possible to apply a more quantitative framework to our model and find the optimal value for \(k\) in a more structured way. The following page will discuss how this is done.