The linear model is one of the most widely used data science tools and one of the most important. In contrast, there is another basic tool: the nearest neighbor method (NN). Prediction and classification are two uses for these models. In practice, classification results (ie. feature classes) are used by machines in many ways: to recognize faces in a crowd, to “read” road signs by distinguishing one letter from another and to set voter registration districts by separating population groups. This article applies and compares linear and non-linear classification methods

**Introduction by Comparison**

**Introduction by Comparison**

A side-by-side comparison of methods reveals their nature:

- The linear model fit using least squares makes strict assumptions about the structure of the model and the underlying data. A closed form matrix equation solves linear regression. The resulting model fit is typically stable, and predications tend to be inflexible in shape and possibly inaccurate;
- The nearest neighbor method makes only limited assumptions about the model and data. Iterative search functions solve nearest neighbor problems. The resulting model fit is very flexible in shape, predictions are usually accurate, but prone to instability.

**Least-Squares Classification**

**Least-Squares Classification**

For example, Figure 1 shows a scatter plot of training data on the input pair and , each of which has a 100 points. The random data can be used to define the output class variable , which sets the color of the background grid with the values and .

First, a linear least squares model applied the training data to define the response variable and to determine the class of the background grid space. Secondly, the input values of are coded as for data points and for data points. Finally, the fitted values define the class variable given the rule:

(1)

Hence, the two-dimensional classification problem in matrix notation is:

(2)

with no intercept term and equal to or depending on the class. The regression line defines the decision boundary given by . In summary, the background grid is shaded in Figure 1 given the classification defined by ().

Its clear, however, that there are several mis-classifications in the data on both sides of the linear decision boundary in Figure 1. As a result, the decision boundary is too inflexible. More likely, the optimal decision boundary is non-linear and maybe disjoint in nature.

**-Nearest Neighbor Method**

In contrast, the nearest neighbor method uses the observations in the training set closest to the point on the background space grid to form . The NN fit for is defined by:

**-Nearest Neighbor Method**

(3)

where is the neighborhood of with the closest points in . Closeness implies a metric, which for simplicity is Euclidean distance. So, in words, we find the observations for the feature grid point closest to in the input space, and average their responses.

In Figure 2, the training data is the same as that in Figure 1. However, the feature space is now classified using the average class of the 25-nearest-neighbors given equation () as the method of fitting. It is now obvious that the decision boundary that separates the from the region is far more irregular, and responds to local clusters and outliers where one class dominates.

Asa result, in Figure 2, with , there are far fewer training observations that are misclassified than in Figure 1. This doesn’t give much comfort, though, since the classification boundary is still prone to errors.

in Figure 3, with , we now see less errors. The classification boundary also has disjoint islands to help separate the data.

In figure 4, *none* of the training data are misclassified with . A little thought suggests that for k-nearest-neighbor fits, the error on the training data should be an increasing function of , and will always be 0 for .

**Final Comparison and Takeaways**

**Final Comparison and Takeaways**

The -nearest-neighbor fits have a single parameter, the number of neighbors, . Linear least squares relies on , the number of model parameters. The number of parameters for -nearest neighbors is . This is generally much bigger than , and decreases with increasing . To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be neighborhoods and we would fit one parameter (a mean) in each neighborhood.

Finally, unlike linear least squares, the sum-of-squared errors on the training set has no merit as a criterion for picking in the nearest neighbor method, since we would always pick as the best answer. However, the boundary for any solution with a low value is unstable, complex and noisy.