K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally used for classification but can also be used for regression tasks. It works by finding the "k" closest data points (neighbors) to a given input and makes a prediction based on the majority class (for classification) or the average value (for regression). Since KNN makes no assumptions about the underlying data distribution, it is a non-parametric and instance-based learning method.
K-Nearest Neighbors is also called a lazy learner algorithm because it does not learn from the training set immediately. Instead, it stores the entire dataset and performs computations only at the time of classification.
For example, consider the following table of data points with two features:
KNN Algorithm Working Visualization
The new point is classified as Category 2 because most of its closest neighbors are blue squares. KNN assigns the category based on the majority of nearby points. The image shows how KNN predicts the category of a new data point based on its closest neighbors.
- The red diamonds represent Category 1 and the blue squares represent Category 2.
- The new data point checks its closest neighbors (circled points).
- Since the majority of its closest neighbors are blue squares (Category 2), KNN predicts the new data point belongs to Category 2.
- KNN works by using proximity and majority voting to make predictions.
What is 'K' in K-Nearest Neighbors?
In the k-Nearest Neighbors algorithm, k is simply a number that tells the algorithm how many nearby points or neighbors to look at when it makes a decision.
Example: Imagine you're deciding what type of fruit it is based on its shape and size. You compare it to fruits you already know.
- If k = 3, the algorithm looks at the 3 closest fruits to the new one.
- If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an apple because most of its neighbors are apples.
How to Choose the Value of k for the KNN Algorithm
The value of k in KNN decides how many neighbors the algorithm looks at when making a prediction. Choosing the right k is important for good results.
- If the data has a lot of noise or outliers, using a larger k can make the predictions more stable.
- However, if k is too large, the model may become too simple and miss important patterns, which is called underfitting.
- So, k should be chosen carefully based on the data.
Statistical Methods for Selecting k
- Cross-Validation: A good way to find the best value of k is by using k-fold cross-validation. This means dividing the dataset into k parts. The model is trained on some of these parts and tested on the remaining ones. This process is repeated for each part. The k value that gives the highest average accuracy during these tests is usually the best one to use.
- Elbow Method: In this method, we draw a graph showing the error rate or accuracy for different k values. As k increases, the error usually drops at first. But after a certain point, the error stops decreasing quickly. The point where the curve changes direction and looks like an "elbow" is usually the best choice for k.
- Odd Values for k: It's a good idea to use an odd number for k, especially in classification problems. This helps avoid ties when deciding which class is the most common among the neighbors.
Distance Metrics Used in the KNN Algorithm
KNN uses distance metrics to identify nearest neighbors. These neighbors are then used for classification and regression tasks. To identify nearest neighbors, we use the following distance metrics:
- Euclidean Distance Euclidean distance is defined as the straight-line distance between two points in a plane or space. You can think of it as the shortest path you would walk if you were to go directly from one point to another.
- Manhattan Distance This is the total distance you would travel if you could only move along horizontal and vertical lines, like a grid or city streets. It's also called "taxicab distance" because a taxi can only drive along the grid-like streets of a city.
- Minkowski Distance Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan distances as special cases. From the formula above, when , it becomes the same as the Euclidean distance formula, and when , it turns into the Manhattan distance formula. Minkowski distance is essentially a flexible formula that can represent either Euclidean or Manhattan distance depending on the value of .
How the KNN Algorithm Works
The K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it predicts the label or value of a new data point by considering the labels or values of its K nearest neighbors in the training dataset.
Step 1: Selecting the optimal value of K K represents the number of nearest neighbors that need to be considered when making a prediction.
Step 2: Calculating distance To measure the similarity between the target and training data points, Euclidean distance is widely used. The distance is calculated between data points in the dataset and the target point.
Step 3: Finding Nearest Neighbors The k data points with the smallest distances to the target point are the nearest neighbors.
Step 4: Voting for Classification or Taking the Average for Regression
- When you want to classify a data point into a category, like spam or not spam, the KNN algorithm looks at the K closest points in the dataset. These closest points are called neighbors. The algorithm then looks at which category the neighbors belong to and picks the one that appears the most. This is called majority voting.
- In regression, the algorithm still looks for the K closest points. But instead of voting for a class, it takes the average of the values of those K neighbors. This average is the predicted value for the new point.
This shows how a test point is classified based on its nearest neighbors. As the test point moves, the algorithm identifies the closest 'k' data points (in this case, 5) and assigns the test point the majority class label, which is the grey label class here.
Implementing KNN from Scratch in Python
- Importing Libraries
Counteris used to count the occurrences of elements in a list or iterable. In KNN, after finding the labels of the k nearest neighbors,Counterhelps count how many times each label appears.
import numpy as np
from collections import Counter
Defining the Euclidean Distance Function
euclidean_distance is used to calculate the Euclidean distance between points.
def euclidean_distance(point1, point2):
return np.sqrt(np.sum((np.array(point1) - np.array(point2))**2))