Efficient way to check high dimensional arrays are overlapped in two ndarray in Python

1.7k Views Asked by At

For example, I have two ndarrays, the shape of train_dataset is (10000, 28, 28) and the shape of val_dateset is (2000, 28, 28).

Except for using iterations, is there any efficient way to use the numpy array functions to find the overlap between two ndarrays?

5

There are 5 best solutions below

0
On

Memory permitting you could use broadcasting, like so -

val_dateset[(train_dataset[:,None] == val_dateset).all(axis=(2,3)).any(0)]

Sample run -

In [55]: train_dataset
Out[55]: 
array([[[1, 1],
        [1, 1]],

       [[1, 0],
        [0, 0]],

       [[0, 0],
        [0, 1]],

       [[0, 1],
        [0, 0]],

       [[1, 1],
        [1, 0]]])

In [56]: val_dateset
Out[56]: 
array([[[0, 1],
        [1, 0]],

       [[1, 1],
        [1, 1]],

       [[0, 0],
        [0, 1]]])

In [57]: val_dateset[(train_dataset[:,None] == val_dateset).all(axis=(2,3)).any(0)]
Out[57]: 
array([[[1, 1],
        [1, 1]],

       [[0, 0],
        [0, 1]]])

If the elements are integers, you could collapse every block of axis=(1,2) in the input arrays into a scalar assuming them as linearly index-able numbers and then efficiently use np.in1d or np.intersect1d to find the matches.

2
On

Full broadcasting generate here a 10000*2000*28*28 =150 Mo boolean array.

For efficiency, you can :

  • pack data, for a 200 ko array:

    from pylab import *
    N=10000
    a=rand(N,28,28)
    b=a[[randint(0,N,N//5)]]
    
    packedtype='S'+ str(a.size//a.shape[0]*a.dtype.itemsize) # 'S6272' 
    ma=frombuffer(a,packedtype)  # ma.shape=10000
    mb=frombuffer(b,packedtype)  # mb.shape=2000
    
    %timeit a[:,None]==b   : 102 s
    %timeit ma[:,None]==mb   : 800 ms
    allclose((a[:,None]==b).all((2,3)),(ma[:,None]==mb)) : True
    

    less memory is helped here by lazy string comparison, breaking at first difference :

    In [31]: %timeit a[:100]==b[:100]
    10000 loops, best of 3: 175 µs per loop
    
    In [32]: %timeit a[:100]==a[:100]
    10000 loops, best of 3: 133 µs per loop
    
    In [34]: %timeit ma[:100]==mb[:100]
    100000 loops, best of 3: 7.55 µs per loop
    
    In [35]: %timeit ma[:100]==ma[:100]
    10000 loops, best of 3: 156 µs per loop
    

Solutions are given here with (ma[:,None]==mb).nonzero().

  • use in1d, for a (Na+Nb) ln(Na+Nb) complexity, against Na*Nb on full comparison :

    %timeit in1d(ma,mb).nonzero()  : 590ms 
    

Not a big gain here, but asymptotically better.

3
On

One trick I learnt from Jaime's excellent answer here is to use an np.void dtype in order to view each row in the input arrays as a single element. This allows you to treat them as 1D arrays, which can then be passed to np.in1d or one of the other set routines.

import numpy as np

def find_overlap(A, B):

    if not A.dtype == B.dtype:
        raise TypeError("A and B must have the same dtype")
    if not A.shape[1:] == B.shape[1:]:
        raise ValueError("the shapes of A and B must be identical apart from "
                         "the row dimension")

    # reshape A and B to 2D arrays. force a copy if neccessary in order to
    # ensure that they are C-contiguous.
    A = np.ascontiguousarray(A.reshape(A.shape[0], -1))
    B = np.ascontiguousarray(B.reshape(B.shape[0], -1))

    # void type that views each row in A and B as a single item
    t = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))

    # use in1d to find rows in A that are also in B
    return np.in1d(A.view(t), B.view(t))

For example:

gen = np.random.RandomState(0)

A = gen.randn(1000, 28, 28)
dupe_idx = gen.choice(A.shape[0], size=200, replace=False)
B = A[dupe_idx]

A_in_B = find_overlap(A, B)

print(np.all(np.where(A_in_B)[0] == np.sort(dupe_idx)))
# True

This method is much more memory-efficient than Divakar's, since it doesn't require broadcasting out to an (m, n, ...) boolean array. In fact, if A and B are row-major then no copying is required at all.


For comparison I've slightly adapted Divakar and B. M.'s solutions.

def divakar(A, B):
    A.shape = A.shape[0], -1
    B.shape = B.shape[0], -1
    return (B[:,None] == A).all(axis=(2)).any(0)

def bm(A, B):
    t = 'S' + str(A.size // A.shape[0] * A.dtype.itemsize)
    ma = np.frombuffer(np.ascontiguousarray(A), t)
    mb = np.frombuffer(np.ascontiguousarray(B), t)
    return (mb[:, None] == ma).any(0)

Benchmarks:

In [1]: na = 1000; nb = 200; rowshape = 28, 28

In [2]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
divakar(A, B)
   ....: 
1 loops, best of 3: 244 ms per loop

In [3]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
bm(A, B)
   ....: 
100 loops, best of 3: 2.81 ms per loop

In [4]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
find_overlap(A, B)
   ....: 
100 loops, best of 3: 15 ms per loop

As you can see, B. M.'s solution is slightly faster than mine for small n, but np.in1d scales better than testing equality for all elements (O(n log n) rather than O(n²) complexity).

In [5]: na = 10000; nb = 2000; rowshape = 28, 28

In [6]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
bm(A, B)
   ....: 
1 loops, best of 3: 271 ms per loop

In [7]: %%timeit A = gen.randn(na, *rowshape); idx = gen.choice(na, size=nb, replace=False); B = A[idx]
find_overlap(A, B)
   ....: 
10 loops, best of 3: 123 ms per loop

Divakar's solution is intractable on my laptop for arrays of this size, since it requires generating a 15GB intermediate array whereas I only have 8GB RAM.

2
On

Solution

def overlap(a,b):
    """
    returns a boolean index array for input array b representing
    elements in b that are also found in a
    """
    a.repeat(b.shape[0],axis=0)
    b.repeat(a.shape[0],axis=0)
    c = aa == bb
    c = c[::a.shape[0]]
    return c.all(axis=1)[:,0]

You can use the returned index array to index b to extract the elements which are also found in a

b[overlap(a,b)]

Explanation

For simplicity's sake I assume you have imported everything from numpy for this example:

from numpy import *

So, for example, given two ndarrays

a = arange(4*2*2).reshape(4,2,2)
b = arange(3*2*2).reshape(3,2,2)

we repeat a and b so that they have the same shape

aa = a.repeat(b.shape[0],axis=0)
bb = b.repeat(a.shape[0],axis=0)

we can then simply compare the elements of aa and bb

c = aa == bb

Finally, to get the indices of the elements in b which are also found in a by looking at every 4th, or actually, every shape(a)[0]th element of c

cc == c[::a.shape[0]]

Finally, we extract an index array with only the elements where all elements in the sub-arrays are True

c.all(axis=1)[:,0]

In our example we get

array([True,  True,  True], dtype=bool)

To check, change the first element of b

b[0] = array([[50,60],[70,80]])

and we get

array([False,  True,  True], dtype=bool)
0
On

This question comes form Google's online deep learning course? The following is my solution:

sum = 0 # number of overlapping rows
for i in range(val_dataset.shape[0]): # iterate over all rows of val_dataset
    overlap = (train_dataset == val_dataset[i,:,:]).all(axis=1).all(axis=1).sum()
    if overlap:
        sum += 1
print(sum)

Automatic broadcasting is used instead of iteration. You may test the performance difference.