K-Nearest Neighbour

K-Nearest Neighbour

K-Nearest Neighbour

K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression and classification. KNN tries to predict the correct class for the test data by calculating the distance between the test data and all the training points. Then select the K number of points which is closet to the test data. The KNN algorithm calculates the probability of the test data belonging to the classes of ‘K’ training data and class holds the highest probability will be selected. In the case of regression, the value is the mean of the ‘K’ selected training points.

Let see the below example to make it a better understanding

Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below image:


Firstly, we will choose the number of neighbors, so we will choose the k=5.

Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as:

By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image:


As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.

How to choose a K value?

Kvalue indicates the count of the nearest neighbors. We have to compute distances between test points and trained labels points. Updating distance metrics with every iteration is computationally expensive, and that’s why KNN is a lazy learning algorithm.

As you can verify from the above image, if we proceed with K=3, then we predict that test input belongs to class B, and if we continue with K=7, then we predict that test input belongs to class A.

That’s how you can imagine that the K value has a powerful effect on KNN performance.

How to select the optimal K value?

There are no pre-defined statistical methods to find the most favorable value of K.

  • Initialize a random K value and start computing.
  • Choosing a small value of K leads to unstable decision boundaries.
  • The substantial K value is better for classification as it leads to smoothening the decision boundaries.
  • Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate.
  • Now you will get the idea of choosing the optimal K value by implementing the model.

Calculating distance:

The first step is to calculate the distance between the new point and each training point. There are various methods for calculating this distance, of which the most commonly known methods are — Euclidian, Manhattan (for continuous) and Hamming distance (for categorical).

Euclidean Distance: Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (y).

Manhattan Distance: This is the distance between real vectors using the sum of their absolute difference.

Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are the same, the distance D will be equal to 0 . Otherwise D=1.

Pros of Using KNN

1) KNN is a perfect first step for machine learning beginners as it is very easy to explain, simple to understand, and extremely powerful. It yields highly competitive results, despite its simplicity. A fantastic application of this is the use of KNN in collaborative filtering algorithms for recommender systems. This is the go-to technique behind the screens of Amazon’s Recommender Systems.
2) KNN is a non-parametric algorithm and does not require any assumptions on the data distribution. This gives KNN an extra edge in specific settings where the data is highly unusual. This is the reason for KNN being the first choice when there is no prior knowledge or very little knowledge about the data distribution.
3) It is a versatile supervised machine learning algorithm that can be used for both regression and classification problems and also search.
4) This algorithm does not have an explicit training step as it is an instance-based learning algorithm. The training step of KNN is pretty fast as it involves only storing feature vectors and class labels of the training samples. Considering the minimal training time, KNN can be a perfect choice for off-the-bat analysis of a dataset on which you are planning to run complex algorithms.
5) Most of the classification algorithms are by default hardcoded for the binary setting. Using them for multi-class problems requires extension from binary or transformation to binary. KNN easily lends itself with multiclass datasets.
6) Flexible distance criteria to choose from when building a KNN model — Euclidean, Manhattan, and Hamming distance. Each of the distance functions has a different purpose based on the type of dataset. Based on the nature of features, it’s possible to choose the best option -Manhattan and Euclidean for numeric, and Hamming for categorical features.

Cons of Using KNN

1) KNN does not have a training phase, however, this comes at a cost of making the prediction step relatively expensive. Every time a prediction is to be made, it searches for the nearest neighbor in the complete training set. This can speed up a bit with a few tricks like KDtrees and BallTrees.
2) The efficiency of the algorithm declines very fast as the dataset grows.
3) It cannot tackle any missing values and you will need a complete features vector for each instance to compute the distance. You can deal with this by filling the missing values with the average value of the feature across the entire dataset.
4) It suffers from skewed class distributions meaning if a specific class occurs frequently in the training set then it is most likely to dominate the majority voting of the new example.
5) The accuracy of KNN deteriorates with high-dimension data as there is hardly any difference between the nearest and farthest neighbor. High dimensionality of datasets is a major problem when working with classification algorithms like KNN. KNN suffers from the curse of dimensionality because it is usually implemented using an approximate nearest neighbor search algorithm such as KD-tree.

Few Applications of KNN Algorithm

1) The biggest application of KNN is recommender systems- recommending ads to display to a user (YouTube) or recommending products (Amazon ), or recommending media to consume. For example, if you buy a smartphone from Amazon, it recommends a mobile cover or earphones to go with it.
2) KNN is used in the retail industry to identify patterns in credit card usage. Most of the new transaction scrutinizing software applications today use KNN to analyze the register data and detect any unusual or suspicious activities. For instance, if the register data of a retail store shows that a lot of information is being entered manually instead of automatically scanning or swiping the card. This is an indication of the fact that the employee is stealing customers information.
3) KNN also finds application in politics for classifying a potential voter as a “will vote” or “will not vote” candidate.
4) Other advanced applications of KNN include video recognition, image recognition, and handwriting detection.

***Thank You***

0 Response to "K-Nearest Neighbour"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel