Load, process and save larger-than-memory array using dask

67 Views Asked by smashwhat At 29 December 2023 at 16:56

I have a very large covariance matrix (480,000 x 480,000) stored on disk in a binary format. I want to compute a corresponding whitening matrix, for which I need to compute the SVD of the covariance matrix. I'm working on a very powerful system (40 cores, 360GB of RAM), but it still wasn't enough to do this computation in-memory. I'm trying to use a combination of numpy.memmap and dask to tackle this problem, since from what I understand, dask is able to read chunks of data, process them, and then even save results chunk-wise. However, I run into a MemoryError whenever I try to work with the full covariance matrix and I don't understand what I am doing wrong.

This is a simplified version of the code I want to use:

import numpy as np
import dask
import dask.array as da
from dask.distributed import LocalCluster, Client

client = Client(processes=False, n_workers=10, threads_per_worker=1, memory_limit="30GiB")

da_memmap = da.from_array(
    np.memmap("path_to_covariance_matrix.npy",
        mode="r",
        shape=(480000, 480000),
        dtype=np.float64
    ),
    chunks=(250, 480000)
)

da_memmap looks like this: da_memmap

u, s, v = da.linalg.svd(da_memmap)

u_path = "/mnt/big_storage/test_u.npy"
s_path = "/mnt/big_storage/test_s.npy"
v_path = "/mnt/big_storage/test_v.npy"

u_map = np.memmap(u_path, mode="w+", dtype=np.float64, shape=da_memmap.shape)
s_map = np.memmap(s_path, mode="w+", dtype=np.float64, shape=da_memmap.shape[0])
v_map = np.memmap(v_path, mode="w+", dtype=np.float64, shape=da_memmap.shape)

da.store([u, s, v], [u_map, s_map, v_map], lock=False, compute=True)

Executing the above code using the full covariance matrix returns the following error.

2023-12-29 17:34:03,672 - distributed.protocol.pickle - ERROR - Failed to serialize <ToPickle: HighLevelGraph with 1 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x786097501630>
 0. 132356258993664
>.
Traceback (most recent call last):
  File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 63, in dumps
    result = pickle.dumps(x, **dump_kwargs)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 68, in dumps
    pickler.dump(x)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 81, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
  File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
MemoryError

Running this code with a smaller covariance matrix, i.e. one that could theoretically fit completely into memory, does work. I played around with the chunk_size, but that doesn't help me. I'm now questioning whether dask is even capable of doing what I want it to do, so any clarifications in that direction would also be useful.

Thanks!

Original Q&A

Load, process and save larger-than-memory array using dask

There are 0 best solutions below

Related Questions in NUMPY

Related Questions in OUT-OF-MEMORY

Related Questions in DASK

Related Questions in NUMPY-MEMMAP

Trending Questions

Popular # Hahtags

Popular Questions