Preface
I am currently using sklearn's Nearest Neighbors (NN) algorithm, and the output of a fitted model's method radius_neighbors returns an list of lists of radii representing the distance from every other observation in the dataset that is <= a specified radius, along with the indices of the other observations.
Example:
import pandas as pd
from sklearn.neighbors import NearestNeighbors
# generate sample dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(10000, 3)), columns=list('ABC'))
# NN algorithm
n = NearestNeighbors(radius=5, metric='chebyshev')
subset = df[["A", "B"]]
n.fit(subset)
# NN output
neighbors = n.radius_neighbors(subset))
# print(neighbors)
#(
# array([array([0., 1., 5., 2., 2., 1., 3.]), array([0., 1., 3.]), ...], dtype=object),
# array([array([0, 23, 452, 31, 75, 903, 2], dtype=int64), array([1, 523, 12], dtype=int64), ...], dtype=object)
#)
Goal
My goal is to find the neighbors for each index that is within some varying radius from a list radii
. I am doing this because NN is a relatively slow algorithm on large datasets, and re-computing neighbors by varying the radius (i.e. calling NearestNeighbors() with a different radius, then fitting the model) is inefficient.
Using the output above as an example:
radii = [1, 2]
For the first element of my dataframe:
- 0, 23, 903 are within 1 units away.
- 0, 23, 31, 75, 903 are within 2 units away.
To make this process as efficient as possible, my idea was to set a conditional on the radii, then use that list of lists to subset the indices.
Ideal
In order to set a conditional on the ragged vectors in radii, I figured the best method is to use list comprehension.
bool_arrays = [arr <= 1 for arr in neighbors[0]]
# However, I cannot just apply these bool_arrays to the indices.
neighbors[1][bool_arrays]
# IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
# and if I try to convert bool_arrays to a numpy array and index that way
neighbors[1][np.array(bool_arrays)]
# VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences
# IndexError: arrays used as indices must be of integer (or boolean) type
What is an efficient method to index a ragged numpy array with a ragged array of booleans using numpy, would I have to use list comprehension, or is there an even better method I'm missing?
I'm not sure if numpy support this kind of boolean comparison with ragged vectors.
If you want to use list comprehension, the method below will do the job: