Background:
I took over some code in a notebook that made heavy use of global variables which on the one hand made it easy to use Pool.imap
, but on the other hand makes it difficult to read, debug and move out of a jupyter notebook into the real world:
This method simply calls another method query_rec
to perform a KNN search around the given point. Note that points_adjusted
and times
are defined outside of the function. query_rec
uses a KDTree
defined outside of its scope:
def get_neighbors(i):
point = points_adjusted[i] + (times[i],)
temp = query_rec(point, INPUT_EVENT_COUNT, 2)
return temp
def query_rec(point, k, rk):
# KNN SEARCH... TOO MUCH CODE AND DOESNT MATTER FOR THE QUESTION
sorted_training_data = [t for t in pool.imap(get_neighbors, np.arange(num_points) if t]
What I want to achieve:
I want to refactor get_neighbors
and query_rec
to not use global variables but still be able to use multi-processing.
What I tried Part 1:
I refactored the above functions so that they would take as arguments the global variables:
def get_neighbors(points, tree, i, k=INPUT_EVENT_COUNT):
point = points[i]
temp = query_rec(point, tree, k, 2)
return temp
Following this up, I have to make an iterable containing all the arguments I want to pass to my newly refactored function:
pool = Pool(NUM_WORKERS)
args = zip([points] * num_points, [training_tree] * num_points, np.arange(num_points))
sorted_training_data = [t for t in pool.starmap(get_neighbors, args) if t]
The Problem with my solution:
There are about 3 million points in points
, and I am making 3 million copies of the KDTree training_tree
. This seems really bad to me.
What I Tried part 2:
I tried encapsulating the functionality I wanted in a new data structure like so:
class TimeTree:
"""
A data structure combining a KDTree and a uniform gridspace of points for efficeient NN searches.
"""
def __init__(self, kdtree, grid_points):
"""
:param kdtree: a KDTree containing event data points in the form (lat, lng, time)
:param grid_points: A uniform gridspace of (lat, lng, time) points
"""
self.tree = kdtree
self.points = grid_points
self.size = len(grid_points)
def search(self, idx, k, rk=2):
"""
A function designed to be used with a multiprocess.pool to perform a global KNN search
of all points in the ``self.points`` list.
:param idx: The index of the point to search around.
:param k: The number of neighbors to search for.
:param rk: A recursive constant for extended search capabilites.
"""
return query_rec(self.points[idx], self.tree, k, rk)
And then created a helper function to generate the data:
def generate_data(k, t, workers=NUM_WORKERS):
args = zip(np.arange(t.size), [k] * t.size)
with Pool(workers) as p:
data = [d for d in tqdm(p.starmap(t.search, args), total=t.size) if d]
return data
I read this was a solution to Pool
objects having pickle problems when using Pool.map
. I believe this might work, except I found yet another global variable inside of the query_rec
definition that I had not noticed before. This may be a solution, and I will update later.
The Question:
How do I efficiently use multi-processing on a function that takes large data structures as arguments?
Actually I would suggest you to use a partial module from functools, it helped me alot:
This f_of_x you can pass to the Pool starmap(f_of_x, zip(X, Y)).
Also you can import all the values for abcd from another file with constants. But be careful: Pool with imported global variables will not update nor change them in the original file if it is executed (it works in a weird way), also it will not pickle lambdas.
Actually there are so many problems you can face related to Pool and other multiprocessing stuff in Python, and sometimes it is hard to find an answer. Some of them could be found by different or related Google queries, good luck :)