How should very large but highly symmetric arrays be handled in Python?

61 Views Asked by At

I am trying to populate and store a NumPy array with ~1 trillion entries with data to be retrieved later. The array has ~50 dimensions with ~7 indices, i.e. it is a rank-7 tensor in 50 dimensions or an array of shape (50,50,50,50,50,50,50). However, the data is totally symmetric meaning that most of the array is redundant. Concretely, an element is 0 whenever an index is repeated, and elements related through any index permutation are always equal. This implies that there are only 50-choose-7 or ~100 million unique, non-zero elements.

The issue is that I lack the RAM to even initialize an empty array large enough to store the data. I could certainly handle an array with only 100 million entries, but I don't know how I would go about doing that.

I have a strong math background, but I am fairly inexperienced with programming. I know that there should be a way to use hard-disk memory to supplement the RAM, but I am confident that I do not have the disk space to accommodate the overflow anyway. I am sure that there is probably a math trick to relate the large, symmetric matrix to a smaller, non-symmetric matrix, but I can't find anything about that anywhere. I would also guess that there is a way to use a sparse matrix data structure to work around the RAM limitations so that the program knows that I won't need the whole thing ahead of time, but I do not understand most of what I find about this -- it doesn't even seem to me that the sorts of sparse matrix applications I am finding apply to this scenario.

I suspect that my inexperience is leading me to use the wrong terminology in my searches and to misunderstand the solutions that I do find.

So what is the best way to manage data of this sort in Python?

1

There are 1 best solutions below

8
J_H On

Exploit the symmetry in this problem. Define a function

def hash(dims, indices):

which always returns a non-negative integer less than 1e8.

When accessing your numpy array, always do so via the hash function, so we're storing fewer than 1e8 distinct entries. At that point it likely doesn't matter whether the underlying numpy storage is sparse or dense.

Often we seek to minimize the number of collisions produced by a hash function. But here, we take advantage of the fact that similar (symmetric) coordinates will hash to the identical value.