More efficient way to access rows based on a list of indices in 2d numpy array?

554 Views Asked by At

So I have 2d numpay array arr. It's a relatively big one: arr.shape = (2400, 60000)

What I'm currently doing is the following:

  • randomly (with replacement) select arr.shape[0] indices
  • access (row-wise) chosen indices of arr
  • calculating column-wise averages and selecting max value
  • I'm repeating it for k times

It looks sth like:

no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
    random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
    my_vals.append(
        arr[random_idxs].mean(axis=0).max()
    )

My problem is that is very slow. With my arr size, it takes ~3s for 1 loop. As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h). I've profiled it and the bottleneck is accessing row based on indices. "mean" and "max" work fast. np.random.choice is also ok.

Do you see any area for improvement? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?

What I tried so far:

  • numpy.take (slower)
  • numpy.ravel :

sth similar to:

random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True) 
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)
  • similar approach to current one but without loop. I created 3d arr and accessed rows across additional dimension in one go
2

There are 2 best solutions below

1
On BEST ANSWER

Since advanced indexing will generate a copy, the program will allocate huge memory in arr[random_idxs].

So one of the most simple way to improve efficiency is that do things batch wise.

BATCH = 512
max(arr[random_idxs,i:i+BATCH].mean(axis=0).max() for i in range(0,arr.shape[1],BATCH))
1
On

This is not a general solution to the problem, but should make your specific problem much faster. Basically, arr.mean(axis=0).max() won't change, so why not take random samples from that array?

Something like:

mean_max = arr.mean(axis=0).max()
my_vals = np.array([np.random.choice(mean_max, size=len(mean_max), replace=True) for i in range(no_samples)])

You may even be able to do: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True), but I'm not sure how, if at all, that would change your statistics.