Form an item index in a masked array, calculate the index of the same item in the original sorted array

195 Views Asked by At

I masked a sorted 1-D numpy array using the method below (which follows a solution proposed here):

def get_from_sorted(sorted,idx):
     mask = np.zeros(sorted.shape, bool)
     mask[idx] = True
     return sorted[mask]

The python method returns the array after masking on the indexes idx. For example, if sorted=np.array([0.1,0.2,0.3.0.4,0.5]), and idx=np.array([4,0,1]), then the method get_from_sorted should return np.array([0.1,0.2,0.5]) (note the order in the original array is preserved.)

Question: I need to get the mapping between the indices of the items in the array after masking and those in the original list. In the example above, such a mapping is

0 -> 0
1 -> 1
2 -> 5

because 0.1, 0.2, and 0.5 is on the 0th, 1st, and 5th place in sorted.

How can I program this mapping efficiently?

Requirement on efficiency: Efficiency is the key in my problem solving. Here, both "idx" and "sorted" is a 1-D array of 1 million elements, and idx is a 1-D array of about 0.5 million elements (taken from an image processing application). Thus, checking the elements of the masked array one by one, or in a vectorized fashion, against the original array, for example, using np.where, would not perform well in my case. Ideally, there should be a relatively simply mathematical relation between the indices in the masked array and the original sorted array. Any idea?

3

There are 3 best solutions below

0
On

I assume (from your example) that the original list is the sorted list. In which case, unless I misunderstand, you just do:

idx.sort()

and then the mapping is i-> idx[i]

Of course, if the original order of idx is important, make a copy first.

0
On

A question is not clear for me. It can have several interpretations.

mask -> idx (in ascending order):

Let me try with this quite large dataset (10M of values, 10% of them are True):

x = np.random.choice(a=[False, True], size=(10000000,), p=[0.9, 0.1])

In this case usage of np.where is quite effective:

%timeit np.where(x)[0]
%timeit x.nonzero()[0]
%timeit np.arange(len(x))[x]
24.8 ms ± 551 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
24.5 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.4 ms ± 895 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

random items of sorted -> idx (in ascending order):

If you have lost any reference to positions of items you need to take from sorted, you're still able to find idx if there are no duplicate items. This is O(n logn):

x = np.random.choice(a=[False, True], size=(10000000,), p=[0.9, 0.1])
arr = np.linspace(0,1,len(x))
sub_arr = arr[x] %input data: skipping 90% of items

%timeit np.searchsorted(arr, sub_arr) %output data
112 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

idx (in any order) -> idx (in ascending order)

this is just simple:

x = np.arange(10000000)
np.random.shuffle(x)
idx = x[:1000000] #input data: first 1M of random idx
%timeit np.sort(idx) #output data
65.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
0
On

If you need to know where the masked entries came from, you can use one of np.where, np.nonzero or np.flatnonzero. However, if you need to get the origins of only a subset of the indices, you can use a function I recently wrote as part of my library, haggis: haggis.npy_util.unmasked_index1.

Given mask and the indices of some of your mask elements, you can retrieve a multi-dimensional index of the original locations with

unmasked_index(idx, mask)

If you ever need it, there is also an inverse function haggis.npy_util.masked_index that converts a location in a multidimensional input array into its index in the masked array.

1Disclaimer: I am the author of haggis.