Large multidimensional array causes memory leak, hope to solve by h5py dataset

567 Views Asked by At

Here is a simple extraction of what I intend to do

loop1 = range(10)
loop2 = range(10)
loop3 = range(100)

list = []
for l in loop1:
    for n in loop2:
       for m in loop3:
           list.append([l,n,m])

dSet = []
for l in list:
    matrix = np.ones((600,600))
    matrix = l[2]*matrix
    dSet.append(matrix)

since there will be 10 thousand 600*600 matrix, the dSet cannot hold that much of data and cause the memory leak every time. So I would like to use h5py(hdf5) to store dSet and flush into disk for every 100 for loop, is there any decent solution?

Thank so much

1

There are 1 best solutions below

0
On

Sure you can do this but it depends what you want to do:

Do you want to store each of the 10 thousands 600x600 matrices in its own dataset or do you want to have a huge matrix (6000000x600) ?

In the first case you create for each matrix it's own datset with dset = f.create_dataset("init", data=myData)

In the second case you have to loop over and write the data in chunks after you created the dataset. Something along these lines:

dset = f.create_dataset("MyDataset", (6000000x600,600), 'f')

for idx,l in enumerate(list):
    start = idx*600
    matrix = np.ones((600,600))
    matrix = l[2]*matrix
    dSet[start:start+600] = matrix

This only works if you know the total size in advance. If you don't you can use extendable datasets (see here for more details)