Numpy boolean index assignment sometimes fails and assigns entire array

177 Views Asked by At

I would like to simply assign a label to each element of an array based on it being below or above a certain threshold and solve this with boolean indexing:

def easy_labeling(arr, thresh=5):
  negative_mask = arr < thresh
  positive_mask = arr >= thresh
  labels = np.empty_like(arr, dtype=str)
  labels[negative_mask] = 'N'
  labels[positive_mask] = 'P'
  return labels

so far so good. I created some dummy arrays to check whether it works:

test_arr1 = np.arange(24).reshape((12,2))
test_arr1
>>> test_arr1
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19],
       [20, 21],
       [22, 23]])
easy_labeling(test_arr1)
>>> array([['N', 'N'],
           ['N', 'N'],
           ['N', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P']], dtype='<U1')
test_arr2 = np.random.randint(12, size=(12,2))
test_arr2
>>> array([[ 1, 11],
           [ 5,  6],
           [11,  7],
           [ 9,  4],
           [11,  3],
           [ 0,  9],
           [ 0,  4],
           [11,  8],
           [ 3,  6],
           [ 0,  1],
           [ 5,  8],
           [10,  4]])
easy_labeling(test_arr2)
>>> array([['N', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'N'],
           ['P', 'N'],
           ['N', 'P'],
           ['N', 'N'],
           ['P', 'P'],
           ['N', 'P'],
           ['N', 'N'],
           ['P', 'P'],
           ['P', 'N']], dtype='<U1')

... and it seems that it does.

However, during my specific application, some other arrays arose - same shape, type and dtype, but different outcome:

test_arr3 = np.array([[ 2,  0,  4,  4], [ 0,  2,  9, 11], [ 4,  4,  6, 10], [11,  5, 10, 15], 
[ 5,  8,  0,  8], [ 3,  6,  5, 11], [ 6,  7,  2,  9], [ 1,  1,  1,  2], [ 9, 11,  3, 14], [ 8, 
10,  7, 17], [10,  3, 11, 14], [ 7,  9,  8, 17]])
test_arr3 = test_arr3[:, 1:3]
test_arr3
>>> array([[ 0,  4],
           [ 2,  9],
           [ 4,  6],
           [ 5, 10],
           [ 8,  0],
           [ 6,  5],
           [ 7,  2],
           [ 1,  1],
           [11,  3],
           [10,  7],
           [ 3, 11],
           [ 9,  8]])
easy_labeling(test_arr3):
>>> array([['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P']], dtype='<U1')

--> all of a sudden, simply all elements are labeled postive, even though there are clearly numbers below 5 contained in the array. As far as I can see, indexing still works, so if I ask for arr[mask], I get the correct elements, however assigning to it produces this incorrect result.

It gets even weirder: While writing down this question I wanted to simplify the above expression and not have to do the "test_arr3 = test_arr3[:, 1:3]" part, so I entered the array I wanted to have directly:

test_arr4 = np.array([[0,  4], [2,  9], [4,  6], [5, 10], [8,  0], [6,  5], [7,  2], [1,  1], 
[11,  3], [10,  7], [3, 11], [9,  8]])
test_arr4
>>> array([[ 0,  4],
           [ 2,  9],
           [ 4,  6],
           [ 5, 10],
           [ 8,  0],
           [ 6,  5],
           [ 7,  2],
           [ 1,  1],
           [11,  3],
           [10,  7],
           [ 3, 11],
           [ 9,  8]])
easy_labeling(test_arr4)
>>> array([['N', 'N'],
           ['N', 'P'],
           ['N', 'P'],
           ['P', 'P'],
           ['P', 'N'],
           ['P', 'P'],
           ['P', 'N'],
           ['N', 'N'],
           ['P', 'N'],
           ['P', 'P'],
           ['N', 'P'],
           ['P', 'P']], dtype='<U1')

... and suddenly it works. Even though the arrays are the same (at least it seems so)!

I made sure that all test arrays have identical type, shape and dtype:

for x in [test_arr1, test_arr2, test_arr3, test_arr4]:
...   print(type(x), x.shape, x.dtype)
>>> <class 'numpy.ndarray'> (12, 2) int32
    <class 'numpy.ndarray'> (12, 2) int32
    <class 'numpy.ndarray'> (12, 2) int32
    <class 'numpy.ndarray'> (12, 2) int32

I assume that the arrays have some type of hidden attribute that I am not aware of, the whole thing makes very little sense to me - anybody got an idea?


A workaround seems to be to use np.chararray(arr.shape, unicode=True) instead of np.empty_like(arr, dtype=str), however I would still like to know what is wrong with the other solution.

1

There are 1 best solutions below

0
On

This looks like a bug in how empty_like handles dtype=str when the input array is not contiguous. (Update: I created a numpy bug report for this issue. The fix has been merged in the main development branch and will be in the next release (NumPy 1.22.0).)

Here's a simple example of the surprising behavior:

In [66]: a = np.arange(9).reshape(3, 3)

In [67]: b = a[:, ::2]

In [68]: b
Out[68]: 
array([[0, 2],
       [3, 5],
       [6, 8]])

In [69]: x = np.empty_like(b, dtype=str)

In [70]: x
Out[70]: 
array([['', ''],
       ['', ''],
       ['', '']], dtype='<U1')

In [71]: x.strides
Out[71]: (0, 0)

The strides attribute of x should not be (0, 0).

Another work-around (in addition to the one you suggested) is to use an explicit NumPy data type instead of str in the call of empty_like:

In [72]: x = np.empty_like(b, dtype='U1')

In [73]: x
Out[73]: 
array([['', ''],
       ['', ''],
       ['', '']], dtype='<U1')

In [74]: x.strides
Out[74]: (8, 4)