- I have a Pandas DataFrame of 2 million entries
- Each entry is a point in a 100 dimensional space
- I want to compute the Euclidian distance between the N last points and all the others to find the closest neighbors (to simplify let's say find the top#1 closest neighbor for the 5 last points)
- I have done the code below for a small dataset, but it's fairly slow, and I'm looking for ideas of improvement (especially speed improvement!)
The logic is the following:
- Split the dataframe between target for which we want to find the closest neighbor and compare : all others among which we will look for the neighbor
- Iterate through the targets
- Compute the Squared Euclidean distance of each df_compare point VS the target
- Select the top#1 value of the compare df and save its ID in the target dataframe
import pandas as pd
import numpy as np
data = {'Name': ['Ly','Gr','Er','Ca','Cy','Sc','Cr','Cn','Le','Cs','An','Ta','Sa','Ly','Az','Sx','Ud','Lr','Si','Au','Co','Ck','Mj','wa'],
'dim0': [33,-9,18,-50,39,-23,-19,89,-74,81,8,23,-63,-62,-14,45,39,-46,74,19,7,97,-29,71,],
'dim1': [-7,75,77,-93,-89,4,-96,-64,41,-27,-87,23,-69,-77,-92,18,21,27,-76,-57,-44,20,15,-76,],
'dim2': [-31,54,-14,-93,72,-14,65,44,-88,19,48,-51,-25,36,-46,98,8,0,53,-47,-29,95,65,-3,],
'dim3': [-12,-86,10,93,-79,-55,-6,-79,-12,66,-81,-14,44,84,9,-19,-69,29,-50,-59,35,-28,90,-73,],
}
df = pd.DataFrame(data)
df_target = df.tail(5)
df_target['closest_neighbour'] = np.nan
df_compare= df.drop(df.tail(5).index)
for i, target_row in df_target.iterrows():
df_compare['distance'] = 0
for dim in df_target.columns:
if dim.startswith('dim'):
df_compare['distance'] = df_compare['distance'] + (target_row[dim] - df_compare[dim])**2
df_compare.sort_values(by=['distance'], ascending=True, inplace=True)
closest_neighbor=df_compare.head(1)
df_target.loc[df_target.index==i,'closest_neighbour']= closest_neighbor['Name'].iloc[0]
print(df_target)
Any suggestion of improvement of the logic or the code is welcome! Cheers
Your dataframe stores the columns as
numpy
arrays. But for your calculations, it would probably be more efficient to have the dimn rows as numpy arrays, because the distance calculation could then leveragenumpy
array operations.