I am trying to find the distance between a point and other 40,000 points.
Each point is a 300 dimension vector.
I am able to find the closes point. How do I find the 10 nearest points in decreasing order?
Function for closest point:
from scipy.spatial import distance
def closest_node(node,df):
closest_index = distance.cdist([node],df.feature.tolist()).argmin()
return pd.Series([df.title.tolist([closest_index],df.id.tolist()[closest_index]])
This command returns the closest title and id:
df3[["closest_title","closest_id"]]=df3.feature.apply(lambda row: closest_node(row,df2))
df2- pandas dataframe of 40,000 points (each 300 dimension)
How do I return the title and index for the 10 closest points
Thanks
Just slice the sorted distance matrix for the top 10 nodes. Something like this:
My assumption based on the word 'title' you have used here, and the choice of 300 dimensional vectors, is that these are word or phrase vectors.
Gensim actually has a manner to get the top N number of similar words based on this idea, which is reasonably fast.
https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html
For something slightly different, this is also slightly similar to the traveling salesman problem (TSP) if you want to get shortest paths between all points, and then simply slice out the first 10 'cities'.
Google has a pretty simple and quick python implementation with OR-Tools here: https://developers.google.com/optimization/routing/tsp.