KNN classifier algorithm not working for all cases

2.2k Views Asked by At

To effectively find n nearest neighbors of a point in d-dimensional space, I selected the dimension with greatest scatter (i.e. in this coordinate differences between points are largest). The whole range from minimal to maximal value in this dimension was split into k bins. Each bin contains points which coordinates (in this dimensions) are within the range of that bin. It was ensured that there are at least 2n points in each bin. The algorithm for finding n nearest neighbors of point x is following:

  1. Identify bin kx,in which point x lies(its projection to be precise).
  2. Compute distances between x and all the points in bin kx.
  3. Sort computed distances in ascending order.
  4. Select first n distances. Points to which these distances were measured are returned as n nearest neighbors of x.

This algorithm is not working for all cases. When algorithm can fail to compute nearest neighbors? Can anyone propose modification of the algorithm to ensure proper operation for all cases?

3

There are 3 best solutions below

1
On

This is failing because (any of) the k nearest neighbors of x could be in a different bin than x.

0
On

What do you mean by "not working"? You do understand that, what you are doing is only an approximate method.
Try normalising the data and then choosing the dimension, else scatter makes no sense.

The best vector for discrimination or for clustering may not be one of the original dimensions, but any combination of dimensions.
Use PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), to identify a discriminative dimension.

0
On

Where KNN failure:

  1. If the data is a jumble of all different classes then knn will fail because it will try to find k nearest neighbours but all points are random

  2. outliers points

Let's say you have two clusters of different classes. Then if you have a outlier point as query, knn will assign one of the classes even though the query point is far away from both clusters.