I have a very large covariance matrix (480,000 x 480,000) stored on disk in a binary format. I want to compute a corresponding whitening matrix, for which I need to compute the SVD of the covariance matrix. I'm working on a very powerful system (40 cores, 360GB of RAM), but it still wasn't enough to do this computation in-memory. I'm trying to use a combination of numpy.memmap and dask to tackle this problem, since from what I understand, dask is able to read chunks of data, process them, and then even save results chunk-wise. However, I run into a MemoryError whenever I try to work with the full covariance matrix and I don't understand what I am doing wrong.
This is a simplified version of the code I want to use:
import numpy as np
import dask
import dask.array as da
from dask.distributed import LocalCluster, Client
client = Client(processes=False, n_workers=10, threads_per_worker=1, memory_limit="30GiB")
da_memmap = da.from_array(
np.memmap("path_to_covariance_matrix.npy",
mode="r",
shape=(480000, 480000),
dtype=np.float64
),
chunks=(250, 480000)
)
da_memmap looks like this: da_memmap
u, s, v = da.linalg.svd(da_memmap)
u_path = "/mnt/big_storage/test_u.npy"
s_path = "/mnt/big_storage/test_s.npy"
v_path = "/mnt/big_storage/test_v.npy"
u_map = np.memmap(u_path, mode="w+", dtype=np.float64, shape=da_memmap.shape)
s_map = np.memmap(s_path, mode="w+", dtype=np.float64, shape=da_memmap.shape[0])
v_map = np.memmap(v_path, mode="w+", dtype=np.float64, shape=da_memmap.shape)
da.store([u, s, v], [u_map, s_map, v_map], lock=False, compute=True)
Executing the above code using the full covariance matrix returns the following error.
2023-12-29 17:34:03,672 - distributed.protocol.pickle - ERROR - Failed to serialize <ToPickle: HighLevelGraph with 1 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x786097501630>
0. 132356258993664
>.
Traceback (most recent call last):
File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 63, in dumps
result = pickle.dumps(x, **dump_kwargs)
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 68, in dumps
pickler.dump(x)
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 81, in dumps
result = cloudpickle.dumps(x, **dump_kwargs)
File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1479, in dumps
cp.dump(obj)
File "/home/shash/miniconda3/envs/analysis/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
return super().dump(obj)
MemoryError
Running this code with a smaller covariance matrix, i.e. one that could theoretically fit completely into memory, does work. I played around with the chunk_size, but that doesn't help me. I'm now questioning whether dask is even capable of doing what I want it to do, so any clarifications in that direction would also be useful.
Thanks!