I am trying to improve performance in a C++ program that performs quite a bit of sparse matrix - dense matrix multiplication. The program originally uses Intel MKL SparseBLAS mkl_scsrmm() function for the matrix multiplication with Sparse format CSR with decent results for small matrices, but not great for large amounts of multiplications of big matrices (above 1024 x 1024).
I tried to use an OpenCL implementation to run the multiplication on the GPU (AMD Radeon R9 200). I used the same implementation from ViennaCL 1.7. However, the MKL function beats the OpenCL pretty much in every trial with matrices smaller than 1024 x 1024, and it is pretty much on par after that.
The OpenCL implementation I used is the following:
__kernel void mat_mult_row_row(__global const unsigned int * sp_mat_row_indices,
__global const unsigned int * sp_mat_col_indices,
__global const float * sp_mat_elements,
__global const float * d_mat,
unsigned int d_mat_row_start,
unsigned int d_mat_col_start,
unsigned int d_mat_row_inc,
unsigned int d_mat_col_inc,
unsigned int d_mat_row_size,
unsigned int d_mat_col_size,
unsigned int d_mat_internal_rows,
unsigned int d_mat_internal_cols,
__global float * result,
unsigned int result_row_start,
unsigned int result_col_start,
unsigned int result_row_inc,
unsigned int result_col_inc,
unsigned int result_row_size,
unsigned int result_col_size,
unsigned int result_internal_rows,
unsigned int result_internal_cols)
{
for (unsigned int row = get_group_id(0); row < result_row_size; row += get_num_groups(0))
{
unsigned int row_start = sp_mat_row_indices[row];
unsigned int row_end = sp_mat_row_indices[row + 1];
for (unsigned int col = get_local_id(0); col < result_col_size; col += get_local_size(0))
{
float r = 0;
for (unsigned int k = row_start; k < row_end; k++)
{
unsigned int j = sp_mat_col_indices[k];
float x = sp_mat_elements[k];
float y = d_mat[(d_mat_row_start + j * d_mat_row_inc) * d_mat_internal_cols + d_mat_col_start + col * d_mat_col_inc];
r += x * y;
}
result[(result_row_start + row * result_row_inc) * result_internal_cols + result_col_start + col * result_col_inc] = r;
}
}
}
I queue the call to this kernel with as many work items as the number of elements in the resulting matrix (or the smallest multiple of 64 larger than the size, since, as I understand, 64 is the wavefront size for AMD devices). The OpenCL code accounts for padding, but I didn't use any.
I created 200 random sparse and dense matrices and multiplied them with both methods, timing each multiplication. For matrices of 1024 x 1024 both MKL and OpenCL implementation timed, on average, 15 ms per multiplication. I understand that memory transfer can be taxing with OpenCL, but with large matrices, the performance of the computation should be better.
I'm not very experienced with OpenCL with this being my second big try at it. Any hints on what can be the problem with the performance or is this as good as I can get at the moment? How can I speed up sparse matrix - dense matrix multiplication beyond Intel MKL Sparse BLAS function if possible? All help is appreciated.