I have a Numpy array on disk, bigger than my available ram.
I can load it as a memory-map and use it without problem:
a = np.memmap(filename, mode='r', shape=shape, dtype=dtype)
Further on, I can load Dask array in a similar manner Following Dask documentation:
da = da.from_array(np.memmap(filename, shape=shape, dtype=dtype, mode='r'))
How to add rows/columns to this array
- ideally without creating whole new copy
- even if whole new copy has to be created, how to process it (it will not fit in ram)
Something like
a2 = np.stack((a, new_a))
will cause whole a array to be loaded into memory and will throw Out of memory.
What's the best approach to extend memmap'ed Numpy or Dask arrays (bigger than available ram)?
I have two ideas.
The first method requires creating a whole new copy, but follows the basic usage of numpy.
The second method is to simply copy as binary files. With this approach, you append to the first file, so at least the first file does not need to be copied.
Note that I'm assuming that your files do not contain headers, since you're reading them with np.memmap. If they do contain headers, the process gets a bit more complicated.