I have more than 1M data points and 32 of them (Orange in the pic) are my true class.
I would like to find similar blue points to the orange ones.
Feature vectors are just embeddings.
The approach that I took is to build a pseudo 95 confidence region and then flag the points within that area as my true label.
I think I cannot use a KNN algorithm for the following reasons:
- I only know beforehand what points belong to the positive class.
- KNN would be highly overfitted as I only have 32 positive data points over more than 1M dat points.
Is there any other algorithm or approach that suits better this problem?
Clustering very large data sets tend to grind to a halt. Here's a crazy idea. Can you take a random sample of the data set and work with that? If the selection process is totally random, it's just a subset of your full data set, and the smaller piece should be very representative of the full thing. It should be as simple as this.
See this link for more info.
https://towardsdatascience.com/how-to-sample-a-dataframe-in-python-pandas-d18a3187139b