Need a function that performs similarly to numpy.where
function, but that doesn't run into memory issues caused by the dense representation of the Boolean array. The function should therefore be able to return an extremely sparse Boolean array.
While the example presented below works fine for small data sets/vectors, it is impossible to use the numpy.where
function once my_sample
is - for example - of shape (10.000.000, 1)
and my_population
is of shape (100.000, 1)
. Having read other threads, numpy.where
apparently creates a dense Boolean array of shape (10.000.000, 100.000)
when evaluating the expression numpy.where((my_sample == my_population.T))
. This dense (10.000.000, 100.000)
array cannot fit into memory on my machine/most machines.
The resulting array is extremely sparse. In my case, know it will have at most two 1s per row! Using the specifications from above, the sparsity equals 0.002%. This should definitely fit into memory.
Trying to create something similar to a model/design matrix for a numerical simulation. The resulting matrix will be used for some linear algebra operations.
Minimal working example: Please note that the positions/coordinates in the vectors are of importance.
# import packages
import numpy as np
# my_sample is the vector of observations
my_sample = ['a', 'b', 'c', 'a']
# my_population is the lookup vector
my_population = ['a', 'b', 'c']
# initalise the matrix (dense matrix for this exampe)
my_zero = np.zeros((len(my_sample), len(my_population)))
# reshape to arrays
my_sample = np.array(my_sample).reshape(-1, 1)
my_population = np.array((my_population)).reshape(-1, 1)
# THIS STEP CAUSES THE MEMORY ISSUES
my_indices = np.where((my_sample == my_population.T))
# set the matches to equal one
my_zero[my_indices] = 1
# show matrix
my_zero
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.]])
First, let's encode this as integers, not strings. Strings suck.
It's important that
pop_levels
andsample_levels
be identical, but if they are you're pretty much done - pack these into sparse masks:And we're done:
You may need to reorder your factor levels so that they're the same between your sample and population, but as long as you can unify those labels this is very simple to do with just matrix assignment.