Nearest Neigborood using a confidence region

55 Views Asked by At

I have more than 1M data points and 32 of them (Orange in the pic) are my true class.
I would like to find similar blue points to the orange ones.
Feature vectors are just embeddings.
enter image description here
The approach that I took is to build a pseudo 95 confidence region and then flag the points within that area as my true label. I think I cannot use a KNN algorithm for the following reasons:

  • I only know beforehand what points belong to the positive class.
  • KNN would be highly overfitted as I only have 32 positive data points over more than 1M dat points.
    Is there any other algorithm or approach that suits better this problem?
1

There are 1 best solutions below

2
On

Clustering very large data sets tend to grind to a halt. Here's a crazy idea. Can you take a random sample of the data set and work with that? If the selection process is totally random, it's just a subset of your full data set, and the smaller piece should be very representative of the full thing. It should be as simple as this.

subset = df.sample(frac=0.5)

See this link for more info.

https://towardsdatascience.com/how-to-sample-a-dataframe-in-python-pandas-d18a3187139b