Numpy memmap in-place sort of a large matrix by column

364 Views Asked by At

I'd like to sort a matrix of shape (N, 2) on the first column where N >> system memory.

With in-memory numpy you can do:

x = np.array([[2, 10],[1, 20]])
sortix = x[:,0].argsort()
x = x[sortix]

But that appears to require that x[:,0].argsort() fit in memory, which won't work for memmap where N >> system memory (please correct me if this assumption is wrong).

Can I achieve this sort in-place with numpy memmap?

(assume heapsort is used for sorting and simple numeric data types are used)

2

There are 2 best solutions below

0
On BEST ANSWER

The solution may be simple, using the order argument to an in place sort. Of course, order requires fieldnames, so those have to be added first.

d = x.dtype
x = x.view(dtype=[(str(i), d) for i in range(x.shape[-1])])
array([[(2, 10)],
   [(1, 20)]], dtype=[('0', '<i8'), ('1', '<i8')])

The field names are strings, corresponding to the column indices. Sorting can be done in place with

x.sort(order='0', axis=0)

Then convert back to a regular array with the original datatype

x.view(d)
array([[ 1, 20],
   [ 2, 10]])

That should work, although you may need to change how the view is taken depending on how the data is stored on disk, see the docs

For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the behavior of the view cannot be predicted just from the superficial appearance of a (shown by print(a)). It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.

0
On

@user2699 answered the question beautifully. I'm adding this solution as a simplified example in case you don't mind keeping your data as a structured array, which does away with the view.

import numpy as np

filename = '/tmp/test'
x = np.memmap(filename, dtype=[('index', '<f2'),('other1', '<f2'),('other2', '<f2')], mode='w+', shape=(2,))
x[0] = (2, 10, 30)
x[1] = (1, 20, 20)
print(x.shape)
print(x)
x.sort(order='index', axis=0, kind='heapsort')
print(x)

(2,)
[(2., 10., 30.) (1., 20., 20.)]
[(1., 20., 20.) (2., 10., 30.)]

Also the dtype formats are documented here.