How to create mini-batches of predefined sizes from a sparse 2D matrix in Python?

66 Views Asked by At

I have a sparse 2D matrix saved on a disk (.npz extension) that I've created in preprocessing step with scipy.sparse.csr_matrix. It is a long sequence of piano-roll (a numerical form of MIDI representation) format 1-channel image. I cannot convert whole matrix to dense representation - it will not fit in my memory.

How do I create mini-batches with predefined sizes from the sparse matrix?

I've tried converting CSR representation to COO and creating batches of data from it.

sparse_matrix = sc.sparse.load_npz(file_name)
coo_matrix = sparse_matrix.tocoo()
for batch_index in range(num_batches):
    start_index = batch_index * num_samples
    end_index = (batch_index + 1) * num_samples

    start_index = batch_index * num_samples
    end_index = (batch_index + 1) * num_samples

    batch_data = coo_matrix.data[start_index:end_index]
    batch_row = coo_matrix.row[start_index:end_index]
    batch_col = coo_matrix.col[start_index:end_index]

    batch_sparse_matrix = scipy.sparse.coo_matrix(
        (batch_data, (batch_row, batch_col)),
        shape=(batch_size, image_width*image_height)
    )

but I got errors like: row index exceeds matrix dimensions which means I have too much data for the shape I defined. The row and col index is outside of shape boundaries.

I've tried something like this, to get the right amount of data, but it's very slow.


non_zero_indices = np.where((co_matrix.row >= start_index) & (co_matrix.row < end_index))[0]

start_index = non_zero_indices[0]
end_index = non_zero_indices[-1] + 1
0

There are 0 best solutions below