What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?

161 Views Asked by Pawel At 25 April 2023 at 03:56

I have a Numpy array on disk, bigger than my available ram. I can load it as a memory-map and use it without problem:

a = np.memmap(filename, mode='r', shape=shape, dtype=dtype)

Further on, I can load Dask array in a similar manner Following Dask documentation:

da = da.from_array(np.memmap(filename, shape=shape, dtype=dtype, mode='r'))

How to add rows/columns to this array

ideally without creating whole new copy
even if whole new copy has to be created, how to process it (it will not fit in ram)

Something like

a2 = np.stack((a, new_a))

will cause whole a array to be loaded into memory and will throw Out of memory.

What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?

Original Q&A

There are 1 best solutions below

ken On 25 April 2023 at 06:51

I have two ideas.

The first method requires creating a whole new copy, but follows the basic usage of numpy.

# First, create a file of the required size. This must be a new file.
out = np.memmap(out_file, mode="w+", dtype=dtype, shape=out_shape)

# Copy files here via memmap.
array1 = np.memmap(in_file1, mode="r", dtype=dtype, shape=shape1)
array2 = np.memmap(in_file2, mode="r", dtype=dtype, shape=shape2)
out[:shape1[0]] = array1
out[shape1[0]:] = array2

# Don't forget to flush.
out.flush()
del out

The second method is to simply copy as binary files. With this approach, you append to the first file, so at least the first file does not need to be copied.

chunk_size = 8192

# The contents of file2 are appended to file1.
with open(file2, "rb") as f2, open(file1, "ab") as f1:
    while True:
        chunk = f2.read(chunk_size)
        if not chunk:
            break
        f1.write(chunk)

Note that I'm assuming that your files do not contain headers, since you're reading them with np.memmap. If they do contain headers, the process gets a bit more complicated.

What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?

There are 1 best solutions below

Related Questions in NUMPY

Related Questions in DASK

Related Questions in NUMPY-MEMMAP

Trending Questions

Popular # Hahtags

Popular Questions