Elementwise comparison between two large vectors, high degree of sparsity

244 Views Asked by At

Need a function that performs similarly to numpy.where function, but that doesn't run into memory issues caused by the dense representation of the Boolean array. The function should therefore be able to return an extremely sparse Boolean array.

While the example presented below works fine for small data sets/vectors, it is impossible to use the numpy.where function once my_sample is - for example - of shape (10.000.000, 1) and my_population is of shape (100.000, 1). Having read other threads, numpy.where apparently creates a dense Boolean array of shape (10.000.000, 100.000) when evaluating the expression numpy.where((my_sample == my_population.T)). This dense (10.000.000, 100.000) array cannot fit into memory on my machine/most machines.

The resulting array is extremely sparse. In my case, know it will have at most two 1s per row! Using the specifications from above, the sparsity equals 0.002%. This should definitely fit into memory.

Trying to create something similar to a model/design matrix for a numerical simulation. The resulting matrix will be used for some linear algebra operations.

Minimal working example: Please note that the positions/coordinates in the vectors are of importance.

# import packages
import numpy as np

# my_sample is the vector of observations
my_sample = ['a', 'b', 'c', 'a']

# my_population is the lookup vector
my_population = ['a', 'b', 'c']

# initalise the matrix (dense matrix for this exampe)
my_zero = np.zeros((len(my_sample), len(my_population)))

# reshape to arrays
my_sample = np.array(my_sample).reshape(-1, 1)
my_population = np.array((my_population)).reshape(-1, 1)

# THIS STEP CAUSES THE MEMORY ISSUES
my_indices = np.where((my_sample == my_population.T))

# set the matches to equal one
my_zero[my_indices] = 1

# show matrix
my_zero
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])
2

There are 2 best solutions below

1
On BEST ANSWER

First, let's encode this as integers, not strings. Strings suck.

pop_levels, pop_idx = np.unique(my_population, return_inverse=True)
sample_levels, sample_idx = np.unique(my_sample, return_inverse=True)

It's important that pop_levels and sample_levels be identical, but if they are you're pretty much done - pack these into sparse masks:

sample_mask = sps.csr_matrix((np.ones_like(sample_idx), sample_idx, range(len(sample_idx) + 1)))

And we're done:

>>> sample_mask.A
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

You may need to reorder your factor levels so that they're the same between your sample and population, but as long as you can unify those labels this is very simple to do with just matrix assignment.

0
On

A more direct route:

In [125]: my_sample = ['a', 'b', 'c', 'a']
     ...: my_population = ['a', 'b', 'c']
     ...: 
     ...: 
In [126]: np.array(my_sample)[:,None]==np.array(my_population)
Out[126]: 
array([[ True, False, False],
       [False,  True, False],
       [False, False,  True],
       [ True, False, False]])

This is a boolean dtype. If you want 0/1 integer matrix:

In [128]: (np.array(my_sample)[:,None]==np.array(my_population)).astype(int)
Out[128]: 
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

If you have space to make my_zero, you should have space to make this array. If there is still a problem with a large temporary buffer, you could try converting to 'uint8', which takes up less space.

In your version you make two large arrays, my_zero and my_sample == my_population.T. But beware that even if you make it past this step, you may not have space to do anything else with my_zero.

Creating a sparse matrix may save you space, but the sparsity has to be quite high to maintain any sort of speed. Though matrix-multiplication is a relative strong area for scipy.sparse matrices.

time tests

In [134]: %%timeit
     ...: pop_levels, pop_idx = np.unique(my_population, return_inverse=True)
     ...: sample_levels, sample_idx = np.unique(my_sample, return_inverse=True)
     ...: sample_mask = sparse.csr_matrix((np.ones_like(sample_idx), sample_idx, range(len(s
     ...: ample_idx) + 1)))

247 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [135]: timeit (np.array(my_sample)[:,None]==np.array(my_population)).astype(int)
9.61 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And make a sparse matrix from the dense:

In [136]: timeit sparse.csr_matrix((np.array(my_sample)[:,None]==np.array(my_population)).as
     ...: type(int))
332 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Scaling for large arrays could be quite different.