From Least Squares to k-Nearest Neighbor (kNN)

The linear model is one of the most widely used data science tools and one of the most important.  In contrast, there is another basic tool:  the k nearest neighbor method (kNN).  Prediction and classification are two uses for these models.  In practice, classification results (ie. feature classes) are used by machines in many ways: to recognize faces in a crowd, to “read” road signs by distinguishing one letter from another and to set voter registration districts by separating population groups.  This article applies and compares linear and non-linear classification methods

Introduction by Comparison

A side-by-side comparison of methods reveals their nature:

  • The linear model fit using least squares makes strict assumptions about the structure of the model and the underlying data.  A closed form matrix equation solves linear regression.  The resulting model fit is typically stable, and predications tend to be inflexible in shape and possibly inaccurate;
  • The knearest neighbor method makes only limited assumptions about the model and data.  Iterative search functions solve nearest neighbor problems. The resulting model fit is very flexible in shape, predictions are usually accurate, but prone to instability.

Least-Squares Classification

For example, Figure 1 shows a scatter plot of training data on the input pair X_{1} and X_{2}, each of which has a 100 points. The random data can be used to define the output class variable G, which sets the color of the background grid with the values \textbf{\textcolor{RoyalBlue}{BLUE}} and \textbf{\textcolor{Orange}{ORANGE}}.

First, a linear least squares model applied the training data to define the response variable Y and to determine the class of the background grid space.  Secondly, the input values of Y are coded as 0 for \textbf{\textcolor{RoyalBlue}{BLUE}} data points and 1 for \textbf{\textcolor{Orange}{ORANGE}} data points. Finally, the fitted values \hat{Y} define the class variable G given the rule:

(1)   \begin{equation*} \hat{G} = \begin{cases} \mathbf{\textcolor{Orange}{ORANGE}} & \text{if} ~ \hat{Y} > 0.5, \\ \mathbf{\textcolor{RoyalBlue}{BLUE}} & \text{if} ~ \hat{Y} \leq 0.5. \end{cases} \end{equation*}

Hence, the two-dimensional classification problem in matrix notation is:

(2)   \begin{equation*} \mathbf{\hat{y}} = \mathbf{x}^T\beta + \mathbf{e} \end{equation*}

with no intercept term and x_i equal to 0 or 1 depending on the class. The regression line defines the decision boundary given by \mathbf{x}^T \beta = 0.50. In summary, the background grid is shaded in Figure 1 given the classification defined by (1).

Its clear, however, that there are several mis-classifications in the data on both sides of the linear decision boundary in Figure 1. As a result, the decision boundary is too inflexible.  More likely, the optimal decision boundary is non-linear and maybe disjoint in nature.

k-Nearest Neighbor Method
In contrast, the k nearest neighbor method uses the k observations in the training set \mathlarger{\mathlarger{\tau}} closest to the point x_i on the background space grid to form \hat{Y}. The kNN fit for \hat{Y} is defined by:

(3)   \begin{equation*} \hat{Y} = \frac{1}{k}\sum_{x_i \in N_k (x)} y_i \end{equation*}

where N_k(x) is the neighborhood of x with the k closest points x_i in \nu. Closeness implies a metric, which for simplicity is Euclidean distance. So, in words, we find the k observations for the feature grid point x_i closest to x in the input space, and average their responses.

In Figure 2, the training data is the same as that in Figure 1.  However, the feature space is now classified using the average class of the 25-nearest-neighbors given equation (3) as the method of fitting. It is now obvious that the decision boundary that separates the \textbf{\textcolor{RoyalBlue}{BLUE}} from the \textbf{\textcolor{Orange}{ORANGE}} region is far more irregular, and responds to local clusters and outliers where one class dominates.

Figure 2 – Click to enlarge

Asa result, in Figure 2, with k=25, there are far fewer training observations that are misclassified than in Figure 1. This doesn’t give much comfort, though, since the classification boundary is still prone to errors.

Figure 3 – Click to enlarge

in Figure 3, with k=8, we now see less errors.  The classification boundary also has disjoint islands to help separate the data. 

Figure 4 – Click to enlarge

In figure 4, none of the training data are misclassified with k=1. A little thought suggests that for k-nearest-neighbor fits, the error on the training data should be an increasing function of k, and will always be 0 for k = 1.

Final Comparison and Takeaways

The k-nearest-neighbor fits have a single parameter, the number of neighbors, k.  Linear least squares relies on p, the number of model parameters. The number of parameters for k-nearest neighbors is N/k.  This is generally much bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.

Finally, unlike linear least squares, the sum-of-squared errors on the training set has no merit as a criterion for picking k in the nearest neighbor method, since we would always pick k = 1 as the best answer.  However, the boundary for any solution with a low value is unstable, complex and noisy.

This entry was posted in Data Science, Modeling, R Programming, Website. Bookmark the permalink.