How to efficiently store and manipulate sparse binary matrices in Octave?

1.3k Views Asked by At

I'm trying to manipulate sparse binary matrices in GNU Octave, and it's using way more memory than I expect, and relevant sparse-matrix functions don't behave the way I want them to. I see this question about higher-than-expected sparse-matrix storage in MATLAB, which suggests that this matrix should consume even more memory, but helped explain (only) part of this situation.

For a sparse, binary matrix, I can't figure out any way to get Octave to NOT STORE the array of values (they're always implicitly 1, so need not be stored). Can this be done? Octave always seems to consume memory for a values array.

A trimmed-down example demonstrating the situation: create random sparse matrix, turn it into "binary":

mys=spones(sprandn(1024,1024,.03)); nnz(mys), whos mys

Shows the situation. The consumed size is consistent with the storage mechanism outlined in aforementioned SO answer and expanded below, if spones() creates an array of storage-class double and if all indices are 32-bit (i.e.,

TotalStorageSize - rowIndices - columnIndices == NumNonZero*sizeof(double)
-- unnecessarily storing these values (all 1s as doubles) is over half of the total memory consumed by this 3%-sparse object.

After messing with this (for too long) while composing this question, I discovered some partial workarounds, so I'm going to "self-answer" (only) part of the question for continuity (hopefully), but I didn't figure out an adequate answer to main question:

How do I create an efficiently-stored ("no-/implicit-values") binary matrix in Octave?

Additional background on storage format follows...

The Octave docs say the storage format for sparse matrices uses format Compressed Sparse Column (CSC). This seems to imply storing the following arrays (expanding on aforementioned SO answer, with canonical Yale format labels and tweaks for column-major order):

  • values (A), number-of-nonzeros (NNZ) entries of storage-class size;
  • row numbers (IA), NNZ entries of index size (hopefully int64 but maybe int32);
  • start of each column (JA), number-of-columns-plus-1 entries of index size)

In this case, for binary-only storage, I hope there's a way to completely avoid storing array (A), but I can't figure it out.

1

There are 1 best solutions below

0
On

Full disclosure: As noted above, as I was composing this question, I discovered a workaround to reduce memory usage, so I'm "self-answering" part of this here, but it still isn't fully satisfying, so I'm still listening for a better actual answer to storage of a sparse binary matrix without a trivial, bloated, unnecessary values array...

To get a binary-like value out of a number-like value and reduce the memory usage in this case, use "logical" storage, created by logical(X). For example, building from above,

logicalmys = logical(mys);

creates a sparse bool matrix, that takes up less memory (1-byte logical rather than 8-byte double for the values array).

Adding more information to the whos information using whos_line_format helps illuminate the situation: The default string includes 5 of the 7 properties (see docs for more). I'm using the format string

whos_line_format(" %a:4; %ln:6; %cs:16:6:1; %rb:12; %lc:8; %e:10; %t:20;\n")

to add display of "elements", and "type" (which is distinct from "class").

With that, whos mys logicalmys shows something like

 Attr Name            Size      Bytes Class      Elements                 Type
 ==== ====            ====      ===== =====      ========                 ==== 
      mys          1024x1024   391100 double        32250        sparse matrix
      logicalmys   1024x1024   165350 logical       32250   sparse bool matrix

So this shows a distinction between sparse matrix and sparse bool matrix. However, the total memory consumed by logicalmys is consistent with actually storing an array of NNZ booleans (1-byte) -- That is:

  • totalMemory minus rowIndices minus columnOffsets leaves NNZ bytes left;

in numbers,

  • 165350 - 32250*4 - 1025*4 == 32250.

So we're still storing 32250 elements, all of which are 1. Further, if you set one of the 1-elements to zero, it reduces the reported storage! For a good time, try: pick a nonzero element, e.g., (42,1), then zero it: logicalmys(42,1) = 0; then whos it!

My hope is that this is correct, and that this clarifies some things for those who might be interested. Comments, corrections, or actual answers welcome!