Memory usage of sparse matrices (scipy)

419 Views Asked by At

I was expecting scipy's sparse matrices to use a lot less memory than the na(t)ive list of lists representation, but experiments have proven me wrong. In the snippet below, I'm building a random binary matrix with around 75% zeroes, and then comparing the memory usage with each available representation in scipy.sparse (simplified IPython session):

# %load_ext memory_profiler, from scipy import sparse, etc.
# ...
%memit M = random_binary_matrix(5000, 5000)  # contains ints
peak memory: 250.36 MiB, increment: 191.77 MiB

In : sum(line.count(0) for line in M) / (len(M) * len(M[0]))
Out: 0.75004468

%memit X_1 = sparse.bsr_matrix(M)
peak memory: 640.49 MiB, increment: 353.76 MiB
%memit X_2 = sparse.coo_matrix(M)
peak memory: 640.71 MiB, increment: 286.09 MiB
%memit X_3 = sparse.csc_matrix(M)
peak memory: 807.51 MiB, increment: 357.53 MiB
%memit X_4 = sparse.csr_matrix(M)
peak memory: 840.04 MiB, increment: 270.91 MiB
%memit X_5 = sparse.dia_matrix(M)
peak memory: 1075.20 MiB, increment: 386.87 MiB
%memit X_6 = sparse.dok_matrix(M)
peak memory: 3059.86 MiB, increment: 1990.62 MiB
%memit X_7 = sparse.lil_matrix(M)
peak memory: 2774.67 MiB, increment: 385.39 MiB

Am I doing something wrong? Am I missing something (including the point of these alternative representations)?

... or is memory_profiler, or my lack of comprehension thereof, to blame? In particular, the relationship between "peak memory" and "increment" seems dubious at times: initialising X_2 supposedly increments memory usage by 286.09 MiB, yet the peak memory usage is barely above what it was prior to executing that line.

If it matters: I'm running Debian 12, Python 3.11.2, IPython 8.5.0, scipy 1.10.1, memory_profiler 0.61.0

2

There are 2 best solutions below

6
Homer512 On

Creating a sparse matrix from a dense matrix results in a fully populated sparse matrix (including explicit zeros). This is because scipy doesn't know the tolerance that you want to use for excluding values close to zero. Arguably, it should come with a factory method that does this for you, but we can always build one manually.

The sparse matrices come with many different combinations of constructor arguments. It seems that you want to exclude only explicit zeros. For this, something like this should work fine:

nonzeros = M.nonzero()
# argument pattern csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
X_4 = sparse.csr_matrix((M[nonzeros], nonzeros), M.shape)

With other tolerances you could use this:

tolerance = 0.1
nonzeros = (abs(M) > tolerance).nonzero()
X_4 = sparse.csr_matrix((M[nonzeros], nonzeros), M.shape)

You can also create a fully populated sparse matrix first and then call eliminate_zeros() on it but this will result in temporarily higher memory usage.

0
hpaulj On

Without adding memit and working with a much smaller dimension, here's a coo matrix:

In [60]: M = sparse.random(10,10, .25, 'coo').astype(bool).astype(int)
In [61]: M
Out[61]: 
<10x10 sparse matrix of type '<class 'numpy.int64'>'
    with 25 stored elements in COOrdinate format>

Its dense equivalent:

In [62]: M.A
Out[62]: 
array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
In [63]: np.count_nonzero(M.A)
Out[63]: 25
In [64]: M.A.nbytes
Out[64]: 800

Memory use of the main arrays - with data and coordinates:

In [65]: M.data.nbytes+M.row.nbytes+M.col.nbytes
Out[65]: 400

Converting to csr can reduce memory use by replacing the M.row with indptr. I say can, since that depends on the shape. Same for csc.

In [66]: Mr=M.tocsr()
In [67]: Mr.data.nbytes+Mr.indptr.nbytes+Mr.indices.nbytes
Out[67]: 344

For a 'ranndom' matrix, bsr does not help:

In [68]: Mb = M.tobsr()
In [69]: Mb
Out[69]: 
<10x10 sparse matrix of type '<class 'numpy.int64'>'
    with 25 stored elements (blocksize = 1x1) in Block Sparse Row format>

Memory of a lil is harder to estimate, since it stores the info in object dtype arrays, as lists:

In [70]: Mi = M.tolil()
In [72]: Mi.data, Mi.rows
Out[72]: 
(array([list([1, 1]), list([1, 1]), list([1, 1]), list([1]),
        list([1, 1, 1, 1, 1, 1]), list([1, 1]), list([1, 1, 1, 1]),
        list([1, 1]), list([1, 1, 1, 1]), list([])], dtype=object),
 array([list([1, 7]), list([8, 9]), list([3, 8]), list([8]),
        list([0, 2, 3, 4, 5, 9]), list([0, 2]), list([1, 2, 3, 5]),
        list([0, 6]), list([0, 3, 6, 7]), list([])], dtype=object))

dok is a dict:

In [73]: Md =M.todok()
In [74]: Md.keys()
Out[74]: dict_keys([(0, 1), (0, 7), (1, 8), (1, 9), (2, 3), (2, 8), (3, 8), (4, 0), (4, 2), (4, 3), (4, 4), (4, 5), (4, 9), (5, 0), (5, 2), (6, 1), (6, 2), (6, 3), (6, 5), (7, 0), (7, 6), (8, 0), (8, 3), (8, 6), (8, 7)])

None of this considers memory use during the conversions. Going from dense to coo requires a nonzero call to get all the nonzero coordinates. Making a coo from the data/row/col inputs may not add any memory use if those arrays are the correct types. Making a csr from coo may require lexical sorting of the coordinates. Most math is done in the csr format, so there can be a lot of hidden conversion between formats. lil and dok are good for iterative inputs, but otherwise slow and more memory intensive.