How to index a numpy list of lists with a list of lists of boolean values?

167 Views Asked by At

Preface

I am currently using sklearn's Nearest Neighbors (NN) algorithm, and the output of a fitted model's method radius_neighbors returns an list of lists of radii representing the distance from every other observation in the dataset that is <= a specified radius, along with the indices of the other observations.

Example:

import pandas as pd
from sklearn.neighbors import NearestNeighbors

# generate sample dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(10000, 3)), columns=list('ABC'))

# NN algorithm
n = NearestNeighbors(radius=5, metric='chebyshev')
subset = df[["A", "B"]]
n.fit(subset)

# NN output
neighbors = n.radius_neighbors(subset))

# print(neighbors)
#(
#    array([array([0., 1., 5., 2., 2., 1., 3.]), array([0., 1., 3.]), ...], dtype=object),
#    array([array([0, 23, 452, 31, 75, 903, 2], dtype=int64), array([1, 523, 12], dtype=int64), ...], dtype=object)
#)

Goal

My goal is to find the neighbors for each index that is within some varying radius from a list radii. I am doing this because NN is a relatively slow algorithm on large datasets, and re-computing neighbors by varying the radius (i.e. calling NearestNeighbors() with a different radius, then fitting the model) is inefficient.

Using the output above as an example:

radii = [1, 2]

For the first element of my dataframe:
 - 0, 23, 903 are within 1 units away.
 - 0, 23, 31, 75, 903 are within 2 units away.

To make this process as efficient as possible, my idea was to set a conditional on the radii, then use that list of lists to subset the indices.

Ideal

In order to set a conditional on the ragged vectors in radii, I figured the best method is to use list comprehension.

bool_arrays = [arr <= 1 for arr in neighbors[0]]

# However, I cannot just apply these bool_arrays to the indices.
neighbors[1][bool_arrays]

# IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

# and if I try to convert bool_arrays to a numpy array and index that way
neighbors[1][np.array(bool_arrays)]

# VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences
# IndexError: arrays used as indices must be of integer (or boolean) type

What is an efficient method to index a ragged numpy array with a ragged array of booleans using numpy, would I have to use list comprehension, or is there an even better method I'm missing?

1

There are 1 best solutions below

1
On

I'm not sure if numpy support this kind of boolean comparison with ragged vectors.

If you want to use list comprehension, the method below will do the job:

radii = [1, 2]
my_list = [[neighbors[1][i][neighbors[0][i] <= rad] for i in range(len(neighbors[0]))] for rad in radii]