Optimize scipy nearest neighbor search

651 Views Asked by At

I am trying to find all the nearest neighbors which are within 1 KM radius. Here is my script to construct tree and search the nearest points,

from pysal.cg.kdtree import KDTree

def construct_tree(s):
    data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
    tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
    return tree

def get_neighbors(s,tree):
    indices = tree.query_ball_point(s, 1)
    return indices

#Constructing the tree for search
tree = construct_tree(data)

#Finding the nearest neighbours within 1KM
data['neighborhood'] = data['lat_long'].apply(lambda row: get_neighbors(row,tree))

From what I read in pysal page, it says -

kd-tree built on top of kd-tree functionality in scipy. If using scipy 0.12 or greater uses the scipy.spatial.cKDTree, otherwise uses scipy.spatial.KDTree.

In my case it should be using cKDTree. This is working fine for a sample dataset, but since the tree.query_ball_point returns the list of indices as a result. Each list will have 100s of elements. For my data points (2 Million records), this is growing bigger and bigger and stops due to memory issue after certain point. Any idea on how to solve this?

1

There are 1 best solutions below

1
On BEST ANSWER

Just in case if anyone looking for an answer for this, I have solved it by finding the nearest neighbours for a group (tree.query_ball_point can handle batches) and write in to database and then process next group, rather than keeping all in memory. Thanks.