What's the best way to store a bi-dimensional sparse array (2d sparse matrix) ? How much size will it have in VoltDB?

1.8k Views Asked by At

Question one: Are there specialized databases to store dense and sparse matrices ? I googled but didn't find any...

The matrix in question is huge (10^5 by 10^5) but it's sparse, which means that most of its values are zeros and I only need to store the non-zero values. So I thought of making a table like this:

   2D Matrix
---------------
  X   Y   val
---------------
  1   2    4.2
  5   1    91.0
  9   3    139.1

And so on. 3 columns, two for the coordinates, the third for the value of that cell in the sparse matrix. Question 2: Is this the best way to store a sparse matrix ? I also considered MongoDB but it seems that making one document per cell of the matrix would be too much overhead. Table oriented databases are slow but I can use VoltDB :) Side-node: I thought of a Redis Hash but can't make it bi-dimensional (found a way to serialize 2D matrixes and make it 1D, that way I can store in a Redis Hash or even List)

Question 3: How many bytes per line will VoltDB use ? The coordinates will be integers ranging from 0 to 10^5 maybe more, the values of the cell will be floats.

2

There are 2 best solutions below

0
On

Regarding Question 3, based on your example, the X and Y columns could be the INTEGER datatype in VoltDB, which is 4 bytes. The value column could be a FLOAT datatype, which is 8 bytes.

Each record would therefore be 16 bytes, so the nominal size in memory would be 16 bytes * row count. In general, you add 30% for overhead, and then 1GB per server for heap size to determine the overall memory needed. See the references below for more detail.

You will probably want to index this table, so assuming you wanted a compound index of (x,y), the size would be as follows:

Tree index: (sum-of-column-sizes + 8 + 32) * rowcount Hash index: (((2 * rowcount) + 1) * 8) + ((sum-of-column-sizes + 32) * rowcount)

sum-of-column-sizes for (x,y) is 8 bytes.

References:

The available datatypes are listed in Appendix A of Using VoltDB: http://community.voltdb.com/docs/UsingVoltDB/ddlref_createtable#TabDatatypes

Guidelines and formulas for estimating the memory size are in the VoltDB Planning Guide: http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs

0
On

The two most relevant questions are 1) how sparse? and 2) how do you want to use the data?

First, I'm assuming you want to read/write/process the data from within the database. If not, then there are many sparse matrix encodings that could even be packed into a blob and optionally compressed.

Assuming your data is fairly sparse, and assuming you want to use the data within the database, the x,y,value tuple storage is probably best. Chapter 4 of the VoltDB Planning Guide covers estimating memory usage and sizing your hardware.

http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs

The short answer is that tables with numeric data are packed pretty tight. You've got 12 bytes of real data per row (short, short, double). You should see an average of just over 1 byte beyond that in overhead per row. You'll also need to add in the size of any indexes. The documentation describes the worst case. I think for an index on two short integers such as the X and Y columns, the storage per key will be close to 28 bytes, including overhead.