So I have 2d numpay array arr. It's a relatively big one: arr.shape = (2400, 60000)
What I'm currently doing is the following:
- randomly (with replacement) select
arr.shape[0]
indices - access (row-wise) chosen indices of
arr
- calculating column-wise averages and selecting max value
- I'm repeating it for k times
It looks sth like:
no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
my_vals.append(
arr[random_idxs].mean(axis=0).max()
)
My problem is that is very slow. With my arr
size, it takes ~3s for 1 loop. As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h). I've profiled it and the bottleneck is accessing row based on indices. "mean"
and "max"
work fast. np.random.choice
is also ok.
Do you see any area for improvement? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?
What I tried so far:
- numpy.take (slower)
- numpy.ravel :
sth similar to:
random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True)
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)
- similar approach to current one but without loop. I created 3d arr and accessed rows across additional dimension in one go
Since advanced indexing will generate a copy, the program will allocate huge memory in
arr[random_idxs]
.So one of the most simple way to improve efficiency is that do things batch wise.