Performance drop in matrix multiplication for certain sizes on AMD Polaris

377 Views Asked by At

I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)

I run it on Polaris 560, 16CU GPU.

Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl

I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.

The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).

I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.

I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.

There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.

I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.

1

There are 1 best solutions below

1
On BEST ANSWER

Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0, ldb % 1024==0 probably due to cache

https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/specialCases/GemmSpecialCases.cpp#L228

I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.

https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109

To handle cases M!=N or M != 1<<n I just increased number of work groups on M/N to neares 1<<n and groups that don't have jobs exit in the begging not adding too much overhead.

z-order improved performance x4 times.