Find 10 closest points in decreasing order

851 Views Asked by At

I am trying to find the distance between a point and other 40,000 points.

Each point is a 300 dimension vector.

I am able to find the closes point. How do I find the 10 nearest points in decreasing order?

Function for closest point:

from scipy.spatial import distance
def closest_node(node,df):
    closest_index = distance.cdist([node],df.feature.tolist()).argmin()
    return pd.Series([df.title.tolist([closest_index],df.id.tolist()[closest_index]])

This command returns the closest title and id:

df3[["closest_title","closest_id"]]=df3.feature.apply(lambda row: closest_node(row,df2))

df2- pandas dataframe of 40,000 points (each 300 dimension)

How do I return the title and index for the 10 closest points

Thanks

2

There are 2 best solutions below

0
On BEST ANSWER

Just slice the sorted distance matrix for the top 10 nodes. Something like this:

from scipy.spatial import distance

# Find the query node
query_node = df.iloc[10] ## Not sure what you're looking for

# Find the distance between this node and everyone else
euclidean_distances = df.apply(lambda row: distance.euclidean(row, query_node), axis=1)

# Create a new dataframe with distances.
distance_frame = pandas.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort("dist", inplace=True)

# nodes
smallest_dist_ixs = distance_frame.iloc[1:10]["idx"]
most_similar_nodes = df.iloc[int(smallest_dist_ixs)]

My assumption based on the word 'title' you have used here, and the choice of 300 dimensional vectors, is that these are word or phrase vectors.
Gensim actually has a manner to get the top N number of similar words based on this idea, which is reasonably fast.

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html

>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]

For something slightly different, this is also slightly similar to the traveling salesman problem (TSP) if you want to get shortest paths between all points, and then simply slice out the first 10 'cities'.

Google has a pretty simple and quick python implementation with OR-Tools here: https://developers.google.com/optimization/routing/tsp.

0
On

As I don't know your complete code of have a sample of data, here would be my suggestion:

Instead of using ".argmin()" just sort your list by distance and then return the first ten elements of the sorted list. Then find their indices like you're already doing it.