Runtime spike when memory usage exceeds specific value

60 Views Asked by At

I'm learning CUDA and currently I am implementing the multiplication of two matrices. I'm testing with matrix dimneions of power of two, e.g. 1024, 2048, 4096 and so on.

Up until 8192 x 8192, the runtime increases by around factor 5-7, but then something strange happens: The calculation of 16384 x 16384 takes around factor 20 more time than the multiplicaton of 8192 x 8192. From that point on, the higher dimensions are slower at the known factor of 5-7, which would match what I observed from 1024^2 to 8192^2.

I'm struggling to find a suitable explanation for this spike in runtime. My first guess was that at some point the three matrices (two factors and one result matrix) don't fit into the L3 cache anymore (128 MB at my machine); but then I should get the spike earlier, between 2048 and 4096 (if I do the math right). VRAM shouldn't be an issue as well (24 GB).

My code:

@cuda.jit
def parallel_matrix_multiplication(matrix1, matrix2, resMatrix, n):
    thread = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x

    if thread < n * n:
        col = thread // n
        row = thread % n 
        row_sum = 0

        for i in range(n):
            row_sum += matrix1[col][i] * matrix2[i][row]

        resMatrix[col][row] = row_sum

if __name__ == '__main__':
    N = 16384
    threads_per_block = 32
    blocks_per_grid = N * N // threads_per_block

    matrix1, matrix2 = fill_matrix()
    result_matrix = np.zeros(shape=(N, N))

    print(f"Measuring time for N = {N} and threads_per_block = {threads_per_block}")
    start_event = cuda.event()
    end_event = cuda.event()
    start_event.record()
    
    parallel_matrix_multiplication[blocks_per_grid, threads_per_block](matrix1, matrix2, result_matrix, N)

    end_event.record()
    end_event.synchronize()

    execution_time = cuda.event_elapsed_time(start_event, end_event)

    print(f"Execution time: {execution_time} ms")

Running on RTX 4090.

0

There are 0 best solutions below