Using the k-nearest neighbor machine learning algorithm for classification, larger values of k:

KNN falls in the supervised learning family of algorithms. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure. Classification is done by a majority vote to its neighbors. The data is assigned to the class which has the most nearest neighbors. As you increase the number of nearest neighbors, the value of k, accuracy might increase.

KNN is a non-parametric, lazy learning algorithm. When we say a technique is non-parametric , it means that it does not make any assumptions on the underlying data distribution.

In KNN algorithm, there is no explicit training phase or it is very minimal

Let’s see How does the K Nearest Neighbor Algorithm works:

Step 1: Choose the number K of neighbors that we are going to have in the algorithm.Here ‘K’ in KNN is the number of nearest neighbours used to classify or (predict in case of continuous variable/regression) a test sample. One of the most common default value of K is 5.

Step 2: Take the K nearest neighbor of the new data point,according to the Euclidean distance. We can also use any other formula of measuring the distance like Manhattan distance or Chebyshev and Hamming distance.

Euclidean distance:Euclidean distance or Euclidean metric is the “ordinary” straight-line distance between two points in Euclidean space.

Euclidean Distance Formula

Step 3: Amoung these K Neighbors , count the number of data points in each category that is how many data points fell into one category to the other category.

When the value of K is small then we are forcing our model to classify blindly and predict the outcome. A small value for K provides the most flexible fit. On the other hand larger value of K will increase the average voters in each prediction and hence is more resilient to outliers.

Step 4: Assign the new data point to the category where we counted the most neighbor.

In KNN, the model structure is determined from the data. KNN is also a lazy algorithm, What this means is that it does not use the training data points to do any generalization.

Application

Very simple implementation as there is no explicit training phase or it is very minimal. KNN can outperform more powerful classifiers and is used in a variety of applications such as economic forecasting, data compression and genetics.

One of the most common and widely used case of K-NN algorithm is Recommender Systems.

That’s all for K Nearest Neighbor Machine learning Algorithm. Stay tuned for further blogs.

Thankyou

K-nearest neighbors is one of the simplest machine learning algorithms  As for many others, human reasoning was the inspiration for this one as well.

Whenever something significant happened in your life, you will memorize this experience. You will later use this experience as a guideline about what you expect to happen next.

Consider you see somebody dropping a glass. While the glass falls, you already make the prediction that the glass will break when it hits the ground. But how can you do this? You never have seen this glass breaking before, right?

No, indeed not. But you have seen similar glasses or in general similar items dropping to the floor before. And while the situation might not be exactly the same, you still know that a glass dropping from about 5 feet on a concrete floor usually breaks. This gives you a pretty high level of confidence to expect breakage whenever you see a glass fall from that height on a hard floor.

But what about dropping a glass from a foot height onto a soft carpet? Did you experience breaking glasses in such situations as well? No, you did not. We can see that the height matters, so does the hardness of the ground.

This way of reasoning is what a k-nearest neighbors algorithm is doing as well.

What exactly is the k-nearest neighbors algorithm?

Whenever a new situation occurs, it scans through all past experiences and looks up the closest experiences. Those experiences (or data points) are what we call the k-nearest neighbors.

Take a classification task, for example. You want to predict if the glass breaks or not, so you take the majority vote of all k-neighbors. If k=5 and in 3 or more of your most similar experiences the glass broke, you go with the prediction “yes, it will break”.

Let’s now assume that you want to predict the number of pieces a glass will break into. In this case, we want to predict a number which we call “regression”.

Now, you take the average value of your k-neighbors’ numbers of glass pieces as a prediction or score.  If k=5 and the numbers of pieces are 1 (did not break), 4, 8, 2, and 10 you will end up with the prediction of 5.

knn_concept

We have blue and orange data points. For a new data point (green), we can determine the most likely class by looking up the classes of the nearest neighbors. Here, the decision would be “blue” because that is the majority of the neighbors.

Why is the k-nearest neighbors algorithm called “lazy”?

Because it does no training at all when you supply the training data. At training time, all it is doing is storing the complete data set but it does not do any calculations at this point. Neither does it try to derive a more compact model from the data which it could use for scoring. Therefore, we call this algorithm lazy.

We have seen that this algorithm is lazy and during training time all it is doing is to store all the data it gets. All the computation happens during scoring, i.e. when we apply the model on unseen data points. We need to determine which k data points out of our training set are closest to the data point we want to get a prediction for.

Let’s say that our data points look like the following:

data_set_math

We have a table of n rows and m+1 columns where the first m columns are the attributes we use to predict the remaining label column (also known as “target”). For now, let’s also assume that all attribute values x are numerical while the label values for y are categorical, i.e. we have a classification problem.

We can now define a distance function which calculates the distance between data points. Especially, it should find the closest data points from our training data for any new point.

The Euclidean distance often is a good choice for such a distance function if the data is numerical. If our new data point has attribute values s1 to sm, we can calculate the distance d(s, xj) between point s to any data point xj by

euclidean_distance_math

The k data points with the closest value for this distance become our k-neighbors. For a classification task, we now use the most frequent of all values y from our k-neighbors. For regression tasks, where y is numerical, we use the average of all values y from our k- neighbors.

But what if our attributes are not numerical or consist of numerical and categorical attributes? Then you can use any other distance measure which can handle this type of data. This article discusses some frequent choices.

By the way, k-nearest neighbors models with k=1 are the reason why calculating training errors are completely pointless. Can you see why?

Practical usage of the k-nearest neighbor algorithm

K-nearest neighbors (or KNN) should be a standard tool in your toolbox. It is fast, easy to understand even for non-experts, and it is easy to tune it to different kind of predictive problems. But there are some things to consider which we will discuss in the following.

Data preparation

We have seen that the key part of the algorithm is the definition of a distance measure. A frequent choice is the Euclidean distance. This distance measure treats all data columns in the same way though. It subtracts the values for each dimension before it sums up the squares of those distances. And that means that columns with a wider data range have a larger influence on the distance than columns with a smaller data range.

So, you should normalize the data set so that all columns are roughly on the same scale. There are two common ways of normalization. First, you could bring all values of a column into a range between 0 and 1. Or you could change the values of each column so that the column has a mean 0 with a standard deviation of 1 afterwards. We call this type of normalization z-transformation or standard score.

Tip: Whenever you know that the machine learning algorithm is making use of a distance measure, you should normalize the data. Another famous example would be k-means clustering.

Parameters to tune

The most important parameter you need to tune is k, the number of neighbors used to make the class decision.  The minimum value is 1 in which case you only look at the closest neighbor for each prediction to make your decision.  In theory, you could use a value for k which is as large as your total training set.  This would make no sense though, since in this case you would always predict the majority class of the complete training set.

Here is a good way to interpret the meaning behind k. Small numbers indicate “local” models, which can be non-linear and the decision boundary between the classes wiggle a lot. If the number grows, the wiggling gets less until you almost end up with a linear decision boundary.

knn_impact_of_k

We see a data set in two dimensions on the left.  In general the top right is red and the bottom left is the blue class.  But there are also some local groups inside of both areas.  Small values for k lead to more wiggly decision boundaries.  For larger values the decision boundary becomes smoother, almost linear in this case.

Good values for k depend on the data you have and if the problem is non-linear or not.  You should try a couple of values between 1 and about 10% of the size of the training data set size.  Then you will see if there is a promising area worth the further optimization of k.

The second parameter you might want to consider is the type of distance function you are using.  For numerical values, Euclidean distance is a good choice.  You might want to try Manhattan distance which is sometimes used as well.  For text analytics, cosine distance can be another good alternative worth trying.

Memory usage & runtimes

Please note that all this algorithm is doing is storing the complete training data.  So, the memory needs grow linearly with the number of data points you provide for training.  Smarter implementations of this algorithm might choose to store the data in a more compact fashion.  But in a worst-case scenario you still end up with a lot of memory usage.

For training, the runtime is as good as it gets.  The algorithm is doing no calculations at all besides storing the data which is fast.

The runtime for scoring though can be large though which is unusual in the world of machine learning.  All calculations happen during model application.  Hence, the scoring runtime scales linearly with the number of data columns m and the number of training points n.  So, if you need to score fast and the number of training data points is large, then k-nearest neighbors is not a good choice.

K-nearest neighbor in RapidMiner

First and foremost, download RapidMiner Studio here. Then you can download the processes below to build this machine learning model yourself in RapidMiner.

  • Train and apply a KNN model: knn_training_scoring
  • Optimize the value for k: knn_optimize_parameter

Please download the Zip-file and extract its content. The result will be an .rmp file which can be loaded into RapidMiner via “File” -> “Import Process”.

What can we infer about our KNN model when the value of k is too big?

If k is selected to be too large, the model becomes too generalized and fails to accurately predict the data points in both train and test sets. This situation is known as underfitting.

What is the best value for K in KNN?

The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value.

Which of the following value of k in KNN would minimize the leave one out cross validation accuracy?

13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy? 5-NN will have least leave one out cross validation error.

What is K

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.