Time taken to copy matrix to host increases by how many times the matrix was used

121 Views Asked by At

I am benchmarking GPU matrix multiplication using PyCUDA, CUDAMat, and Numba and ran into some behavior I can't find a way to explain.
I calculate the time it takes for 3 different steps independently - sending the 2 matrices to device memory, calculating the dot product, and copying the results back to host memory.
The benchmarking for the dot product step is done in a loop since my applications will be doing many multiplications before sending the result back.

As I increase the number of loops, the dot product time increases linearly just as expected. But the part I can't understand is that the time it takes to send the final result back to host memory also increases linearly with the loop count, even though it is only copying one matrix back to host memory. The size of the result is constant no matter how many matrix multiplication loops you do, so this makes no sense. It behaves as if returning the final result requires returning all the intermediate results at each step in the loop.

Some interesting things to note are that the increase in time it takes has a peak. As I go above ~1000 dot products in a loop the time it takes to copy the final result back reaches a peak. Another thing is if inside the dot product loop I reinitialize the matrix that holds the result this behavior stops and the copy back time is the same no matter how many multiplies are done.
For example -

for i in range(1000):
    gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
result = gc.get()

The final thing to note is that this happens for both PyCUDA and Numba, but does not happen with CUDAMat. I can do a million multiplies and retrieving the final result will still take the same amount of time. CUDAMat has a built in matrix multiplication which could be why, but for PyCUDA and Numba I am using the matrix multiplication code provided in their own documentation.

Here is my code for PyCUDA

from __future__ import division
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import time
import pycuda.autoinit

kernel_code_template = """
__global__ void MatrixMulKernel(float *A, float *B, float *C)
{

  const int wA = %(MATRIX_SIZE)s;
  const int wB = %(MATRIX_SIZE)s;  

  // Block index
  const int bx = blockIdx.x;
  const int by = blockIdx.y;

  // Thread index
  const int tx = threadIdx.x;
  const int ty = threadIdx.y;

  // Index of the first sub-matrix of A processed by the block
  const int aBegin = wA * %(BLOCK_SIZE)s * by;
  // Index of the last sub-matrix of A processed by the block
  const int aEnd = aBegin + wA - 1;
  // Step size used to iterate through the sub-matrices of A
  const int aStep = %(BLOCK_SIZE)s;

  // Index of the first sub-matrix of B processed by the block
  const int bBegin = %(BLOCK_SIZE)s * bx;
  // Step size used to iterate through the sub-matrices of B
  const int bStep = %(BLOCK_SIZE)s * wB;

  // The element of the block sub-matrix that is computed
  // by the thread
  float Csub = 0;
  // Loop over all the sub-matrices of A and B required to
  // compute the block sub-matrix
  for (int a = aBegin, b = bBegin;
       a <= aEnd;
       a += aStep, b += bStep) 
    {
      // Shared memory for the sub-matrix of A
      __shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
      // Shared memory for the sub-matrix of B
      __shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];

      // Load the matrices from global memory to shared memory
      // each thread loads one element of each matrix
      As[ty][tx] = A[a + wA * ty + tx];
      Bs[ty][tx] = B[b + wB * ty + tx];
      // Synchronize to make sure the matrices are loaded
      __syncthreads();

      // Multiply the two matrices together;
      // each thread computes one element
      // of the block sub-matrix
      for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
        Csub += As[ty][k] * Bs[k][tx];

      // Synchronize to make sure that the preceding
      // computation is done before loading two new
      // sub-matrices of A and B in the next iteration
      __syncthreads();
    }

  // Write the block sub-matrix to global memory;
  // each thread writes one element
  const int c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
  C[c + wB * ty + tx] = Csub;
}
"""


MATRIX_SIZE = 512
TILE_SIZE = 8
BLOCK_SIZE = TILE_SIZE
np.random.seed(100)
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)

kernel_code = kernel_code_template % {
    'MATRIX_SIZE': MATRIX_SIZE,
    'BLOCK_SIZE': BLOCK_SIZE,
}
mod = compiler.SourceModule(kernel_code)
matrixmul = mod.get_function("MatrixMulKernel")


#copy to device memory
total = time.clock()
ga = gpuarray.to_gpu(a_cpu)
gb = gpuarray.to_gpu(b_cpu)
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
copy_to = time.clock() - total

#matrix multiplication
mult = time.clock()
for i in range(1000):
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
mult = time.clock() - mult

#copy result back to host memory
copy_from = time.clock()
res = gc.get()
copy_from = time.clock() - copy_from
total = time.clock() - total

#print out times for all 3 steps and the total time taken
print(copy_to)
print(mult)
print(copy_from)
print(total)
1

There are 1 best solutions below

1
On BEST ANSWER

GPU kernel launches are asynchronous. This means that the measurement you think you are capturing around the for-loop (the time it takes to do the multiplication) is not really that. It is just the time it takes to issue the kernel launches into a queue.

The actual kernel execution time is getting "absorbed" into your final measurement of device->host copy time (because the D->H copy forces all kernels to complete before it will begin, and it blocks the CPU thread).

Regarding the "peak" behavior, when you launch enough kernels into the queue, eventually it stops becoming asynchronous and begins to block the CPU thread, so your "execution time" measurement starts rising. This explains the varying peak behavior.

To "fix" this, if you insert a pycuda driver.Context.synchronize() immediately after your for-loop, and before this line:

mult = time.clock() - mult 

you will see your execution time increase as you increase the for loop, and your D->H copy time will remain constant.