Nearest Neigborood using a confidence region

66 Views Asked by 3nomis At 28 June 2025 at 16:20

I have more than 1M data points and 32 of them (Orange in the pic) are my true class.
I would like to find similar blue points to the orange ones.
Feature vectors are just embeddings.

The approach that I took is to build a pseudo 95 confidence region and then flag the points within that area as my true label. I think I cannot use a KNN algorithm for the following reasons:

I only know beforehand what points belong to the positive class.
KNN would be highly overfitted as I only have 32 positive data points over more than 1M dat points.
Is there any other algorithm or approach that suits better this problem?

Original Q&A

There are 1 best solutions below

ASH On 06 December 2021 at 04:59

Clustering very large data sets tend to grind to a halt. Here's a crazy idea. Can you take a random sample of the data set and work with that? If the selection process is totally random, it's just a subset of your full data set, and the smaller piece should be very representative of the full thing. It should be as simple as this.

subset = df.sample(frac=0.5)

See this link for more info.

https://towardsdatascience.com/how-to-sample-a-dataframe-in-python-pandas-d18a3187139b

Nearest Neigborood using a confidence region

There are 1 best solutions below

Related Questions in CLUSTER-ANALYSIS

Related Questions in K-MEANS

Related Questions in KNN

Related Questions in NEAREST-NEIGHBOR

Related Questions in SEMISUPERVISED-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions