I would like to ask a question for a numpy array below.
I have a dataset, which has 50 rows and 15 columns and I created a numpy array as such:
x=x.to_numpy()
My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.
Sample table:
a b c
1 6 2
2 6 8
4 7 12
7 9 13
for example for row 1 and row2 there is no such a row. But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those. So the algorithm should return the count 2 (which indicates the row 3 and 4).
Which Python code should be implemented to get this particular return.
I have tried a bunch of code, but could not reach a proper solution. So if anyone has an idea on that I would be appreciated.
Edit: Pure-numpy solution
Explanation
The hard part is to think in 3d, so I start in 2d, with simple comparison of numbers. Imagine you have
x=np.array([1,2,3,4])and you want to compare all elements of x to all other elments of x, making a matrix 4x4 matrix of booleans.What you would do, is to reshape x as a column of values on one side, and as a line on the other. So two 2d arrays: one 4x1, the other 1x4.
Then, when performing an operation among those two arrays, broadcasting will create a 4x4 array.
Just to visualize it, instead of comparison, let's do this
So, all we have to do is the exact same thing. But with values being length-3 1d arrays instead of scalars:
x.reshape(-1, 1, 3) > x.reshape(1, -1, 3)Broadcasting will make this, as in previous example, a 2d array of all
x[i]>x[j], except thatx[i],x[j]and thereforex[i]>x[j]are not values, but 1d length 3 arrays. So our result is a 2d array of length 3 1d array, aka a 3d array.Now we just have to do our all, any, sum on this. For
x[i]to be consideredx[j], we need all the values ofx[i]to be>to all values ofx[j]. Hence theallon axis 2 (the axis of length 3). Now we have a 2d matrix telling for each i,j ifx[i]>x[j].For
x[j]to have a smaller counterpart, that is forx[j]to be greater to at least onex[i], we need at least one True onx[j]column. Hence theany(axis=1).And lastly, what we have at this point is a 1d array of booleans, True if it exists at least one smaller value. We just need to count them. Hence the
.sum()Compound iteration
One-liner (with one loop. Not ideal, but better than 2 loops)
r>xis an array of booleans comparing each elemnts of rowrto each element ofx. So, for example, whenris rowx[2], thenr>xisSo
(r>x).all(axis=1)is a shape(4,)array of booleans telling if all booleans in each line (because.alliterates through columns only,axis=1) are True or not. In previous example, that would be[True, True, False, False].(x[1]>x).all(axis=1)would be[False, False, False, False](first line ofx[1]>xcontains 2True, but that is not enough for.all)So
(r>x).all(axis=1).any()tells what you want to know: if there is any line whose all columns areTrue. That is if there is any True in previous array.((r>x).all(axis=1).any() for r in x)is an iterator of this computation for all rowsrof x. If you replaced the outer()by[,], you would get a list ofTrueandFalse(False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others). But no need to build a list here, since we just want to count. A compound iterator will produce result only as the caller will require, and here, the caller issum.sum((r>x).all(axis=1).any() for r in x)counts the number of times we getTruein the previous computation.(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory)
Timings
For your example, computation takes 19 μs for pure numpy, 48 μs for former answer and 115 μs for di.bezrukov's.
But difference (and absence of difference) shows when the number of rows grows. For 10000×3 data, then, computation takes 3.9 seconds for both my answers, and di.bezrukov's method takes 353 seconds.
Reason behind this 2 facts:
Still, all 3 methods are O(n²). O(n²×m) even, if we call
mthe number of columnsAll have 3 nested loops. Di.bezrukov's have two explicit python
forloop, and one implicit loop in the.all(still a for loop, even if it is done in numpy's internal code). My compound version has 1 pythoncompound forloop, and 2 implicit loops.alland.any.My pure numpy version have no explicit loop, but 3 implicity numpy's nested loop (in the building of the 3d array)
So same time structure. Only numpy's loop are faster.
I am prouder of my pure numpy version, because I didn't found it at first. But pragmatically, my first version (compound) is better. It is slower only when it doesn't matter (for very small arrays). It doesn't consume any memory. And it numpize only the outer loop, that is negligible before inner loop.
tl;dr:
Unless you really have only 4 rows and μs matter, or you are engaged in a contest of who can think in purest numpy 3d-chess :D, in which case