Machine Learning : Introduction to K Nearest Neighbor (KNN) in Python

In machine learning, most problems are of classification compare to regression problems. For example, who will win the election, flip a coin and the outcome will be head or tail? rolling a die and what number out of 6 will you get? etc. Here, we are going to discuss one of the widely used regression and classification algorithm K-nearest neighbors (KNN).
Why do we need KNN?
Till now we have seen that machine learning algorithm makes prediction by learning from the past data. But KNN works on the feature similarity. The more features are similar to a category for one data point, it will fall in that category. For example, based on features predict that given animal is horse or dog. Based on number of features matching with any of the category (dog or horse) it will be classified in that category.
What is KNN algorithm?
K-nearest neighbors is one of the simplest supervised machine learning algorithms. kNN classifies the data point based on how their neighbors are classified. It is a curious machine learning algorithm. It is also known as an instance based learning algorithm or feature similarity algorithm. It is one of the lazy learning algorithms as you do not need to explicitly build a model. K in kNN is a parameter that refers to number of nearest neighbors. For example k is 5 then a new data point is classified by majority of data points from 5 nearest neighbors. The kNN algorithm can be used in both classification and regression but it is most widely used in classification problem.
How does KNN algorithm work?
Let's take an example. We have a small dataset having height and weight of some persons. Based on their height and weight, they are classified as underweight or normal. The table shows those data.
Height (CM) | Weight (KG) | Class |
---|---|---|
167 | 51 | Underweight |
182 | 62 | Normal |
176 | 69 | Normal |
173 | 64 | Normal |
172 | 65 | Normal |
174 | 56 | Underweight |
169 | 58 | Normal |
173 | 57 | Normal |
170 | 55 | Normal |
170 | 57 | ? |
Now find out that the person is underweight or normal, given HEIGHT = 170 CM
and WEIGHT = 57 KG
. Let's assume that we don't know how to calculate the BMI (Body Mass Index). To find the nearest neighbor, we calculate the Euclidean distance. According to Euclidean distance formula, the distance between two points x
and y
is given by: $$ dist(d) = { \sqrt{(x_i-y_i)^2 + (x_j-y_j)^2} } $$
Here the x
and y
are points having 2 features. The Euclidean distance for multiple features can be presented as
follows:
$$ dist(d) = { \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 +...+ (x_i-y_i)^2 +...+ (x_n-y_n)^2}} $$
We calculate Euclidean distance between unknown data point and all the known data points in dataset.
So in our example, the unknown data point is represented in red. We are finding the Euclidean distance between other points and red point.
Height (CM) | Weight (KG) | Class | Euclidean Distance |
---|---|---|---|
167 | 51 | Underweight | 6.7 |
182 | 62 | Normal | 13 |
176 | 69 | Normal | 13.4 |
173 | 64 | Normal | 7.6 |
172 | 65 | Normal | 8.2 |
174 | 56 | Underweight | 4.1 |
169 | 58 | Normal | 1.4 |
173 | 57 | Normal | 3 |
170 | 55 | Normal | 2 |
For above example, if we choose k=3 then 3 points closest to the red point are as given in following table.
Height (CM) | Weight (KG) | Class | Euclidean Distance |
---|---|---|---|
169 | 58 | Normal | 1.4 |
173 | 57 | Normal | 3 |
170 | 55 | Normal | 2 |
According to above table, we can say that the closest neighbors belong to the class Normal, so our new point HEIGHT = 170 CM
and WEIGHT = 57 KG
, falls in the category: Normal.
How do we choose the factor K?
KNN is based on feature similarity and choosing the right value of K is a process called parameter tuning and is important for better accuracy. But the question is, how do we determine that some points are close to each other and to what extent they are similar? Let's understand by an example figure given below:

In the image given above there are some pentagons in red and squares in green. There is a new data point given, which needs to be classified either as a pentagon or as a square. From the image, we can see that if we choose K=3 then 3 nearest members include two pentagons and one square. Here 2 is closer to K=3, so it is classified as pentagon. If we choose K=7 then the majority of 7 nearest members are squares, so the new point is classified as square.
So now which size of K
should we choose? If the size of K
is too small, it is likely to have a higher influence on the result. On the other hand if the K
is too high, it becomes computationally expensive and generates inaccurate results as well.
For choosing the best k we can use confusion matrix. We can apply the algorithm for multiple values of K on same dataset. The value for which the KNN gives best accuracy will be selected as the best K.
When do we use KNN?
K-nearest neighbors method has been successful in classification as well as regression. As it is a non-parametric algorithm, it is widely effective in classification problems. It can also be used in identifying hand written texts, satellite image sources, etc.
The advantage of KNN is that we can easily add some new observations without worry of computational efforts. The disadvantage is that it becomes sensitive if the features are noisy and irrelevant. It also computes distance to every point for making prediction so it includes computational intensity.
KNN using Python
We are going to use the Iris dataset for classifying iris plants into three species (Iris-setosa, Iris-versicolor, Iris-verginica) in Pyhton using the KNN algorithm.
The sklearn
library provides iris dataset to be used directly without downloading it manually. Let's see it by example.
# load necessary libraries
import pandas as pd
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# load Iris dataset
iris = datasets.load_iris()
# get X (features) and y (target) from whole dataset
X = iris.data
y = iris.target
# split them in train and test parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)
# standardize features by removing the mean and scaling to unit variance
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.transform(X_test)
# Apply knn algorithm with multiple values of k on Iris dataset and find the best k
for k in range(2,30):
classifier = KNeighborsClassifier(n_neighbors=k, metric='Euclidean')
classifier.fit(X_train, y_train)
y_preds = classifier.predict(X_test)
print "Accuracy for k =", k, "is:", accuracy_score(y_test, y_preds)
Output:
Choose best K
From the above output we can notice that we get 0.97 accurate result for the values of k 6,10,11,12. Here for 10,11,12 we get constantly same high accurate result. So we are going to choose k=11 in our case.