I would like to simply assign a label to each element of an array based on it being below or above a certain threshold and solve this with boolean indexing:
def easy_labeling(arr, thresh=5):
negative_mask = arr < thresh
positive_mask = arr >= thresh
labels = np.empty_like(arr, dtype=str)
labels[negative_mask] = 'N'
labels[positive_mask] = 'P'
return labels
so far so good. I created some dummy arrays to check whether it works:
test_arr1 = np.arange(24).reshape((12,2))
test_arr1
>>> test_arr1
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19],
[20, 21],
[22, 23]])
easy_labeling(test_arr1)
>>> array([['N', 'N'],
['N', 'N'],
['N', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P']], dtype='<U1')
test_arr2 = np.random.randint(12, size=(12,2))
test_arr2
>>> array([[ 1, 11],
[ 5, 6],
[11, 7],
[ 9, 4],
[11, 3],
[ 0, 9],
[ 0, 4],
[11, 8],
[ 3, 6],
[ 0, 1],
[ 5, 8],
[10, 4]])
easy_labeling(test_arr2)
>>> array([['N', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'N'],
['P', 'N'],
['N', 'P'],
['N', 'N'],
['P', 'P'],
['N', 'P'],
['N', 'N'],
['P', 'P'],
['P', 'N']], dtype='<U1')
... and it seems that it does.
However, during my specific application, some other arrays arose - same shape, type and dtype, but different outcome:
test_arr3 = np.array([[ 2, 0, 4, 4], [ 0, 2, 9, 11], [ 4, 4, 6, 10], [11, 5, 10, 15],
[ 5, 8, 0, 8], [ 3, 6, 5, 11], [ 6, 7, 2, 9], [ 1, 1, 1, 2], [ 9, 11, 3, 14], [ 8,
10, 7, 17], [10, 3, 11, 14], [ 7, 9, 8, 17]])
test_arr3 = test_arr3[:, 1:3]
test_arr3
>>> array([[ 0, 4],
[ 2, 9],
[ 4, 6],
[ 5, 10],
[ 8, 0],
[ 6, 5],
[ 7, 2],
[ 1, 1],
[11, 3],
[10, 7],
[ 3, 11],
[ 9, 8]])
easy_labeling(test_arr3):
>>> array([['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P'],
['P', 'P']], dtype='<U1')
--> all of a sudden, simply all elements are labeled postive, even though there are clearly numbers below 5 contained in the array. As far as I can see, indexing still works, so if I ask for arr[mask], I get the correct elements, however assigning to it produces this incorrect result.
It gets even weirder: While writing down this question I wanted to simplify the above expression and not have to do the "test_arr3 = test_arr3[:, 1:3]" part, so I entered the array I wanted to have directly:
test_arr4 = np.array([[0, 4], [2, 9], [4, 6], [5, 10], [8, 0], [6, 5], [7, 2], [1, 1],
[11, 3], [10, 7], [3, 11], [9, 8]])
test_arr4
>>> array([[ 0, 4],
[ 2, 9],
[ 4, 6],
[ 5, 10],
[ 8, 0],
[ 6, 5],
[ 7, 2],
[ 1, 1],
[11, 3],
[10, 7],
[ 3, 11],
[ 9, 8]])
easy_labeling(test_arr4)
>>> array([['N', 'N'],
['N', 'P'],
['N', 'P'],
['P', 'P'],
['P', 'N'],
['P', 'P'],
['P', 'N'],
['N', 'N'],
['P', 'N'],
['P', 'P'],
['N', 'P'],
['P', 'P']], dtype='<U1')
... and suddenly it works. Even though the arrays are the same (at least it seems so)!
I made sure that all test arrays have identical type, shape and dtype:
for x in [test_arr1, test_arr2, test_arr3, test_arr4]:
... print(type(x), x.shape, x.dtype)
>>> <class 'numpy.ndarray'> (12, 2) int32
<class 'numpy.ndarray'> (12, 2) int32
<class 'numpy.ndarray'> (12, 2) int32
<class 'numpy.ndarray'> (12, 2) int32
I assume that the arrays have some type of hidden attribute that I am not aware of, the whole thing makes very little sense to me - anybody got an idea?
A workaround seems to be to use np.chararray(arr.shape, unicode=True) instead of np.empty_like(arr, dtype=str), however I would still like to know what is wrong with the other solution.
This looks like a bug in how
empty_like
handlesdtype=str
when the input array is not contiguous. (Update: I created a numpy bug report for this issue. The fix has been merged in the main development branch and will be in the next release (NumPy 1.22.0).)Here's a simple example of the surprising behavior:
The
strides
attribute ofx
should not be(0, 0)
.Another work-around (in addition to the one you suggested) is to use an explicit NumPy data type instead of
str
in the call ofempty_like
: