What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?

161 Views Asked by At

I have a Numpy array on disk, bigger than my available ram. I can load it as a memory-map and use it without problem:

a = np.memmap(filename, mode='r', shape=shape, dtype=dtype)

Further on, I can load Dask array in a similar manner Following Dask documentation:

da = da.from_array(np.memmap(filename, shape=shape, dtype=dtype, mode='r'))

How to add rows/columns to this array

  • ideally without creating whole new copy
  • even if whole new copy has to be created, how to process it (it will not fit in ram)

Something like

a2 = np.stack((a, new_a))

will cause whole a array to be loaded into memory and will throw Out of memory.

What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?

1

There are 1 best solutions below

2
ken On

I have two ideas.

The first method requires creating a whole new copy, but follows the basic usage of numpy.

# First, create a file of the required size. This must be a new file.
out = np.memmap(out_file, mode="w+", dtype=dtype, shape=out_shape)

# Copy files here via memmap.
array1 = np.memmap(in_file1, mode="r", dtype=dtype, shape=shape1)
array2 = np.memmap(in_file2, mode="r", dtype=dtype, shape=shape2)
out[:shape1[0]] = array1
out[shape1[0]:] = array2

# Don't forget to flush.
out.flush()
del out

The second method is to simply copy as binary files. With this approach, you append to the first file, so at least the first file does not need to be copied.

chunk_size = 8192

# The contents of file2 are appended to file1.
with open(file2, "rb") as f2, open(file1, "ab") as f1:
    while True:
        chunk = f2.read(chunk_size)
        if not chunk:
            break
        f1.write(chunk)

Note that I'm assuming that your files do not contain headers, since you're reading them with np.memmap. If they do contain headers, the process gets a bit more complicated.